Accurate contact predictions for thousands of protein families using PconsC3

Mapping Intimacies ◽

10.1101/079673 ◽

2016 ◽

Cited By ~ 1

Author(s):

Marcin J. Skwark ◽

Mirco Michel ◽

David Menéndez Hurtado ◽

Magnus Ekeberg ◽

Arne Elofsson

Keyword(s):

Structure Prediction ◽

De Novo ◽

Pfam Domain ◽

Three Dimensional ◽

Improved Method ◽

Protein Families ◽

Large Protein ◽

Multiple Sequence ◽

Residue Contact ◽

Contact Predictions

Protein structure prediction was for decades one of the grand unsolved challenges in bioinformatics. A few years ago it was shown that by using a maximum entropy approach to describe couplings between columns in a multiple sequence alignment it was possible to significantly increase the accuracy of residue contact predictions. For very large protein families with more than 1000 effective sequences the accuracy is sufficient to produce accurate models of proteins as well as complexes. Today, for about half of all Pfam domain families no structure is known, but unfortunately most of these families have at most a few hundred members, i.e. are too small for existing contact prediction methods. To extend accurate contact predictions to the thousands of smaller protein families we present PconsC3, an improved method for protein contact predictions that can be used for families with as little as 100 effective sequence members. We estimate that PconsC3 provides accurate contact predictions for up to 4646 Pfam domain families. In addition, PconsC3 outperforms previous methods significantly independent on family size, secondary structure content, contact range, or the number of selected contacts. This improvement translates into improved de-novo prediction of three-dimensional structures. PconsC3 is available as a web server and downloadable version at http://c3.pcons.net. The downloadable version is free for all to use and licensed under the GNU General Public License, version 2.

Faculty Opinions recommendation of De novo prediction of three-dimensional structures for major protein families.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.1009270.123258 ◽

2002 ◽

Author(s):

Angel Ortiz

Keyword(s):

De Novo ◽

Three Dimensional ◽

Major Protein ◽

Protein Families

FingerprintContacts: Predicting Alternative Conformations of Proteins from Coevolution

10.1101/2020.04.13.037234 ◽

2020 ◽

Author(s):

Jiangyan Feng ◽

Diwakar Shukla

Keyword(s):

Ligand Binding ◽

Structure Prediction ◽

De Novo ◽

Three Dimensional ◽

Sequence Information ◽

Structural Constraints ◽

Complex Signals ◽

Residue Contacts ◽

Small Clusters ◽

Functional Mechanisms

AbstractProteins are dynamic molecules which perform diverse molecular functions by adopting different three-dimensional structures. Recent progress in residue-residue contacts prediction opens up new avenues for the de novo protein structure prediction from sequence information. However, it is still difficult to predict more than one conformation from residue-residue contacts alone. This is due to the inability to deconvolve the complex signals of residue-residue contacts, i.e. spatial contacts relevant for protein folding, conformational diversity, and ligand binding. Here, we introduce a machine learning based method, called FingerprintContacts, for extending the capabilities of residue-residue contacts. This algorithm leverages the features of residue-residue contacts, that is, (1) a single conformation outperforms the others in the structural prediction using all the top ranking residue-residue contacts as structural constraints, and (2) conformation specific contacts rank lower and constitute a small fraction of residue-residue contacts. We demonstrate the capabilities of FingerprintContacts on eight ligand binding proteins with varying conformational motions. Furthermore, FingerprintContacts identifies small clusters of residue-residue contacts which are preferentially located in the dynamically fluctuating regions. With the rapid growth in protein sequence information, we expect FingerprintContacts to be a powerful first step in structural understanding of protein functional mechanisms.

AlignmentViewer: Sequence Analysis of Large Protein Families

10.1101/269720 ◽

2018 ◽

Cited By ~ 1

Author(s):

Roc Reguant ◽

Yevgeniy Antipin ◽

Rob Sheridan ◽

Augustin Luna ◽

Chris Sander

Keyword(s):

Open Source Software ◽

Source Code ◽

Web Browsers ◽

Protein Families ◽

Large Protein ◽

Multiple Sequence ◽

Internet Connection ◽

Visualization Analysis ◽

Link Type ◽

Evolutionary Coupling

AbstractSummaryAlignmentViewer is multiple sequence alignment viewer for protein families with flexible visualization, analysis tools and links to protein family databases. It is directly accessible in web browsers without the need for software installation, as it is implemented in JavaScript, and does not require an internet connection to function. It can handle protein families with tens of thousands of sequences and is particularly suitable for evolutionary coupling analysis, facilitating the computation of protein 3D structures and the detection of functionally constrained interactions.Availability and ImplementationAlignmentViewer is open source software under the MIT license. The viewer is at http://alignmentviewer.org and the source code, documentation and issue tracking, for co-development, are at https://github.com/dfci/[email protected], reaches all authors

Residue contacts predicted by evolutionary covariance extend the application ofab initiomolecular replacement to larger and more challenging protein folds

IUCrJ ◽

10.1107/s2052252516008113 ◽

2016 ◽

Vol 3 (4) ◽

pp. 259-270 ◽

Cited By ~ 11

Author(s):

Felix Simkovic ◽

Jens M. H. Thomas ◽

Ronan M. Keegan ◽

Martyn D. Winn ◽

Olga Mayans ◽

...

Keyword(s):

Ab Initio ◽

Structure Prediction ◽

Sequence Information ◽

Protein Targets ◽

Residue Contact ◽

Residue Contacts ◽

Structure Solution ◽

Improved Performance ◽

Model Ensembles ◽

Contact Predictions

For many protein families, the deluge of new sequence information together with new statistical protocols now allow the accurate prediction of contacting residues from sequence information alone. This offers the possibility of more accurateab initio(non-homology-based) structure prediction. Such models can be used in structure solution by molecular replacement (MR) where the target fold is novel or is only distantly related to known structures. Here,AMPLE, an MR pipeline that assembles search-model ensembles fromab initiostructure predictions (`decoys'), is employed to assess the value of contact-assistedab initiomodels to the crystallographer. It is demonstrated that evolutionary covariance-derived residue–residue contact predictions improve the quality ofab initiomodels and, consequently, the success rate of MR using search models derived from them. For targets containing β-structure, decoy quality and MR performance were further improved by the use of a β-strand contact-filtering protocol. Such contact-guided decoys achieved 14 structure solutions from 21 attempted protein targets, compared with nine for simpleRosettadecoys. Previously encountered limitations were superseded in two key respects. Firstly, much larger targets of up to 221 residues in length were solved, which is far larger than the previously benchmarked threshold of 120 residues. Secondly, contact-guided decoys significantly improved success with β-sheet-rich proteins. Overall, the improved performance of contact-guided decoys suggests that MR is now applicable to a significantly wider range of protein targets than were previously tractable, and points to a direct benefit to structural biology from the recent remarkable advances in sequencing.

De NovoPrediction of Human Chromosome Structures: Epigenetic Marking Patterns Encode Genome Architecture

10.1101/173088 ◽

2017 ◽

Cited By ~ 4

Author(s):

Michele Di Pierro ◽

Ryan R. Cheng ◽

Erez Lieberman Aiden ◽

Peter G. Wolynes ◽

Jos&eacute N. Onuchic

Keyword(s):

Neural Network ◽

Phase Separation ◽

Structure Prediction ◽

De Novo ◽

Three Dimensional ◽

Sufficient Information ◽

Proximity Ligation ◽

Genomic Annotation ◽

Structural Ensembles ◽

Chromatin Structural

Inside the cell nucleus, genomes fold into organized structures that are characteristic of cell type. Here, we show that this chromatin architecture can be predictedde novousing epigenetic data derived from ChIP-Seq. We exploit the idea that chromosomes encode a one-dimensional sequence of chromatin structural types. Interactions between these chromatin types determine the three-dimensional (3D) structural ensemble of chromosomes through a process similar to phase separation. First, a recurrent neural network is used to infer the relation between the epigenetic marks present at a locus, as assayed by ChIP-Seq, and the genomic compartment in which those loci reside, as measured by DNA-DNA proximity ligation (Hi-C). Next, types inferred from this neural network are used as an input to an energy landscape model for chromatin organization (MiChroM) in order to generate an ensemble of 3D chromosome conformations. After training the model, dubbed MEGABASE (Maximum Entropy Genomic Annotation from Biomarkers Associated to Structural Ensembles), on odd numbered chromosomes, we predict the chromatin type sequences and the subsequent 3D conformational ensembles for the even chromosomes. We validate these structural ensembles by using ChIP-Seq tracks alone to predict Hi-C maps as well as distances measured using 3D FISH experiments. Both sets of experiments support the hypothesis of phase separation being the driving process behind compartmentalization. These findings strongly suggest that epigenetic marking patterns encode sufficient information to determine the global architecture of chromosomes and thatde novostructure prediction for whole genomes may be increasingly possible.

Contrastive learning on protein embeddings enlightens midnight zone at lightning speed

10.1101/2021.11.14.468528 ◽

2021 ◽

Author(s):

Michael Heinzinger ◽

Maria Littmann ◽

Ian Sillitoe ◽

Nicola Bordin ◽

Christine Orengo ◽

...

Keyword(s):

Structure Prediction ◽

Sequence Similarity ◽

3D Structure ◽

Three Dimensional ◽

Hierarchical Classification ◽

Language Models ◽

Sequence Alignments ◽

Sequence Comparisons ◽

Multiple Sequence ◽

3D Structures

Thanks to the recent advances in protein three-dimensional (3D) structure prediction, in particular through AlphaFold 2 and RoseTTAFold, the abundance of protein 3D information will explode over the next year(s). Expert resources based on 3D structures such as SCOP and CATH have been organizing the complex sequence-structure-function relations into a hierarchical classification schema. Experimental structures are leveraged through multiple sequence alignments, or more generally through homology-based inference (HBI) transferring annotations from a protein with experimentally known annotation to a query without annotation. Here, we presented a novel approach that expands the concept of HBI from a low-dimensional sequence-distance lookup to the level of a high-dimensional embedding-based annotation transfer (EAT). Secondly, we introduced a novel solution using single protein sequence representations from protein Language Models (pLMs), so called embeddings (Prose, ESM-1b, ProtBERT, and ProtT5), as input to contrastive learning, by which a new set of embeddings was created that optimized constraints captured by hierarchical classifications of protein 3D structures. These new embeddings (dubbed ProtTucker) clearly improved what was historically referred to as threading or fold recognition. Thereby, the new embeddings enabled the intrusion into the midnight zone of protein comparisons, i.e., the region in which the level of pairwise sequence similarity is akin of random relations and therefore is hard to navigate by HBI methods. Cautious benchmarking showed that ProtTucker reached much further than advanced sequence comparisons without the need to compute alignments allowing it to be orders of magnitude faster. Code is available at https://github.com/Rostlab/EAT .

Distillation of MSA Embeddings to Folded Protein Structures with Graph Transformers

10.1101/2021.06.02.446809 ◽

2021 ◽

Author(s):

Allan Costa ◽

Manvitha Ponnapati ◽

Joseph M Jacobson ◽

Pranam Chatterjee

Keyword(s):

Structure Prediction ◽

Tertiary Structure ◽

Protein Structures ◽

Three Dimensional ◽

Protein Sequences ◽

Language Models ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Folded Structures

Determining the structure of proteins has been a long-standing goal in biology. Language models have been recently deployed to capture the evolutionary semantics of protein sequences. Enriched with multiple sequence alignments (MSA), these models can encode protein tertiary structure. In this work, we introduce an attention-based graph architecture that exploits MSA Transformer embeddings to directly produce three-dimensional folded structures from protein sequences. We envision that this pipeline will provide a basis for efficient, end-to-end protein structure prediction.

Using de novo protein structure predictions to measure the quality of very large multiple sequence alignments

Bioinformatics ◽

10.1093/bioinformatics/btv592 ◽

2015 ◽

Vol 32 (6) ◽

pp. 814-820 ◽

Cited By ~ 14

Author(s):

Gearóid Fox ◽

Fabian Sievers ◽

Desmond G. Higgins

Keyword(s):

Protein Structure ◽

Structure Prediction ◽

De Novo ◽

Biological Data ◽

Supplementary Information ◽

Test Case ◽

Sequence Alignments ◽

Progressive Alignment ◽

Multiple Sequence ◽

Multiple Sequence Alignments

Abstract Motivation: Multiple sequence alignments (MSAs) with large numbers of sequences are now commonplace. However, current multiple alignment benchmarks are ill-suited for testing these types of alignments, as test cases either contain a very small number of sequences or are based purely on simulation rather than empirical data. Results: We take advantage of recent developments in protein structure prediction methods to create a benchmark (ContTest) for protein MSAs containing many thousands of sequences in each test case and which is based on empirical biological data. We rank popular MSA methods using this benchmark and verify a recent result showing that chained guide trees increase the accuracy of progressive alignment packages on datasets with thousands of proteins. Availability and implementation: Benchmark data and scripts are available for download at http://www.bioinf.ucd.ie/download/ContTest.tar.gz. Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.

Faculty Opinions recommendation of De novo prediction of three-dimensional structures for major protein families.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.1009270.149357 ◽

2002 ◽

Author(s):

Janet Thornton

Keyword(s):

De Novo ◽

Three Dimensional ◽

Major Protein ◽

Protein Families

Generating, Maintaining, and Exploiting Diversity in a Memetic Algorithm for Protein Structure Prediction

Evolutionary Computation ◽

10.1162/evco_a_00176 ◽

2016 ◽

Vol 24 (4) ◽

pp. 577-607 ◽

Cited By ~ 23

Author(s):

Mario Garza-Fabre ◽

Shaun M. Kandathil ◽

Julia Handl ◽

Joshua Knowles ◽

Simon C. Lovell

Keyword(s):

Protein Structure ◽

Protein Structure Prediction ◽

Structure Prediction ◽

Tertiary Structure ◽

De Novo ◽

Memetic Algorithm ◽

Scale Up ◽

Three Dimensional ◽

Limiting Factors ◽

Genetic Operators

Computational approaches to de novo protein tertiary structure prediction, including those based on the preeminent “fragment-assembly” technique, have failed to scale up fully to larger proteins (on the order of 100 residues and above). A number of limiting factors are thought to contribute to the scaling problem over and above the simple combinatorial explosion, but the key ones relate to the lack of exploration of properly diverse protein folds, and to an acute form of “deception” in the energy function, whereby low-energy conformations do not reliably equate with native structures. In this article, solutions to both of these problems are investigated through a multistage memetic algorithm incorporating the successful Rosetta method as a local search routine. We found that specialised genetic operators significantly add to structural diversity and that this translates well to reaching low energies. The use of a generalised stochastic ranking procedure for selection enables the memetic algorithm to handle and traverse deep energy wells that can be considered deceptive, which further adds to the ability of the algorithm to obtain a much-improved diversity of folds. The results should translate to a tangible improvement in the performance of protein structure prediction algorithms in blind experiments such as CASP, and potentially to a further step towards the more challenging problem of predicting the three-dimensional shape of large proteins.