OPUS-X: An Open-Source Toolkit for Protein Torsion Angles, Secondary Structure, Solvent Accessibility, Contact Map Predictions, and 3D Folding

Bioinformatics ◽

10.1093/bioinformatics/btab633 ◽

2021 ◽

Author(s):

Gang Xu ◽

Qinghua Wang ◽

Jianpeng Ma

Keyword(s):

Secondary Structure ◽

Open Source ◽

Solvent Accessibility ◽

3D Structure ◽

Previous Method ◽

Supplementary Information ◽

Torsion Angles ◽

Structure Information ◽

Evolutionary Features ◽

Gradient Based

Abstract Motivation The development of an open-source platform to predict protein 1D features and 3D structure is an important task. In this paper, we report an open-source toolkit for protein 3D structure modeling, named OPUS-X. It contains three modules: OPUS-TASS2, which predicts protein torsion angles, secondary structure and solvent accessibility; OPUS-Contact, which measures the distance and orientation information between different residue pairs; and OPUS-Fold2, which uses the constraints derived from the first two modules to guide folding. Results OPUS-TASS2 is an upgraded version of our previous method OPUSS-TASS. OPUS-TASS2 integrates protein global structure information and significantly outperforms OPUS-TASS. OPUS-Contact combines multiple raw co-evolutionary features with protein 1D features predicted by OPUS-TASS2, and delivers better results than the open-source state-of-the-art method trRosetta. OPUS-Fold2 is a complementary version of our previous method OPUS-Fold. OPUS-Fold2 is a gradient-based protein folding framework based on the differentiable energy terms in opposed to OPUS-Fold that is a sampling-based method used to deal with the non-differentiable terms. OPUS-Fold2 exhibits comparable performance to the Rosetta folding protocol in trRosetta when using identical inputs. OPUS-Fold2 is written in Python and TensorFlow2.4, which is user-friendly to any source-code level modification. Availability The code and pre-trained models of OPUS-X can be downloaded from https://github.com/OPUS-MaLab/opus_x. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

OPUS-X: An Open-Source Toolkit for Protein Torsion Angles, Secondary Structure, Solvent Accessibility, Contact Map Predictions, and 3D Folding

10.1101/2021.05.08.443219 ◽

2021 ◽

Author(s):

Gang Xu ◽

Qinghua Wang ◽

Jianpeng Ma

Keyword(s):

Secondary Structure ◽

Open Source ◽

Solvent Accessibility ◽

3D Structure ◽

Previous Method ◽

Torsion Angles ◽

Structure Information ◽

Evolutionary Features ◽

Comparable Performance ◽

Gradient Based

In this paper, we report an open-source toolkit for protein 3D structure modeling, named OPUS-X. It contains three modules: OPUS-TASS2, which predicts protein torsion angles, secondary structure and solvent accessibility; OPUS-Contact, which measures the distance and orientations information between different residue pairs; and OPUS-Fold2, which uses the constraints derived from the first two modules to guide folding. OPUS-TASS2 is an upgraded version of our previous method OPUSS-TASS (Bioinformatics 2020, 36 (20), 5021-5026). OPUS-TASS2 integrates protein global structure information and significantly outperforms OPUS-TASS. OPUS-Contact combines multiple raw co-evolutionary features with protein 1D features predicted by OPUS-TASS2, and delivers better results than the open-source state-of-the-art method trRosetta. OPUS-Fold2 is a complementary version of our previous method OPUS-Fold (J. Chem. Theory Comput. 2020, 16 (6), 3970-3976). OPUS-Fold2 is a gradient-based protein folding framework based on the differentiable energy terms in opposed to OPUS-Fold that is a sampling-based method used to deal with the non-differentiable terms. OPUS-Fold2 exhibits comparable performance to the Rosetta folding protocol in trRosetta when using identical inputs. OPUS-Fold2 is written in Python and TensorFlow2.4, which is user-friendly to any source-code level modification. The code and pre-trained models of OPUS-X can be downloaded from https://github.com/OPUS-MaLab/opus_x.

Download Full-text

Developing structural profile matrices for protein secondary structure and solvent accessibility prediction

Bioinformatics ◽

10.1093/bioinformatics/btz238 ◽

2019 ◽

Vol 35 (20) ◽

pp. 4004-4010 ◽

Cited By ~ 3

Author(s):

Zafer Aydin ◽

Nuh Azginoglu ◽

Halil Ibrahim Bilgin ◽

Mete Celik

Keyword(s):

Secondary Structure ◽

Structure Prediction ◽

Solvent Accessibility ◽

3D Structure ◽

Protein Secondary Structure ◽

Amino Acid Position ◽

Statistical Hypothesis ◽

Supplementary Information ◽

Structural Profile ◽

Benchmark Datasets

Abstract Motivation Predicting secondary structure and solvent accessibility of proteins are among the essential steps that preclude more elaborate 3D structure prediction tasks. Incorporating class label information contained in templates with known structures has the potential to improve the accuracy of prediction methods. Building a structural profile matrix is one such technique that provides a distribution for class labels at each amino acid position of the target. Results In this paper, a new structural profiling technique is proposed that is based on deriving PFAM families and is combined with an existing approach. Cross-validation experiments on two benchmark datasets and at various similarity intervals demonstrate that the proposed profiling strategy performs significantly better than Homolpro, a state-of-the-art method for incorporating template information, as assessed by statistical hypothesis tests. Availability and implementation The DSPRED method can be accessed by visiting the PSP server at http://psp.agu.edu.tr. Source code and binaries are freely available at https://github.com/yusufzaferaydin/dspred. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

OPUS-TASS: a protein backbone torsion angles and secondary structure predictor based on ensemble neural networks

Bioinformatics ◽

10.1093/bioinformatics/btaa629 ◽

2020 ◽

Vol 36 (20) ◽

pp. 5021-5026 ◽

Cited By ~ 3

Author(s):

Gang Xu ◽

Qinghua Wang ◽

Jianpeng Ma

Keyword(s):

Neural Networks ◽

Secondary Structure ◽

Structure Prediction ◽

Secondary Structure Prediction ◽

Supplementary Information ◽

Learning Approaches ◽

Protein Backbone ◽

Torsion Angles ◽

Secondary Structure Predictions ◽

The Mean

Abstract Motivation Predictions of protein backbone torsion angles (ϕ and ψ) and secondary structure from sequence are crucial subproblems in protein structure prediction. With the development of deep learning approaches, their accuracies have been significantly improved. To capture the long-range interactions, most studies integrate bidirectional recurrent neural networks into their models. In this study, we introduce and modify a recently proposed architecture named Transformer to capture the interactions between the two residues theoretically with arbitrary distance. Moreover, we take advantage of multitask learning to improve the generalization of neural network by introducing related tasks into the training process. Similar to many previous studies, OPUS-TASS uses an ensemble of models and achieves better results. Results OPUS-TASS uses the same training and validation sets as SPOT-1D. We compare the performance of OPUS-TASS and SPOT-1D on TEST2016 (1213 proteins) and TEST2018 (250 proteins) proposed in the SPOT-1D paper, CASP12 (55 proteins), CASP13 (32 proteins) and CASP-FM (56 proteins) proposed in the SAINT paper, and a recently released PDB structure collection from CAMEO (93 proteins) named as CAMEO93. On these six test sets, OPUS-TASS achieves consistent improvements in both backbone torsion angles prediction and secondary structure prediction. On CAMEO93, SPOT-1D achieves the mean absolute errors of 16.89 and 23.02 for ϕ and ψ predictions, respectively, and the accuracies for 3- and 8-state secondary structure predictions are 87.72 and 77.15%, respectively. In comparison, OPUS-TASS achieves 16.56 and 22.56 for ϕ and ψ predictions, and 89.06 and 78.87% for 3- and 8-state secondary structure predictions, respectively. In particular, after using our torsion angles refinement method OPUS-Refine as the post-processing procedure for OPUS-TASS, the mean absolute errors for final ϕ and ψ predictions are further decreased to 16.28 and 21.98, respectively. Availability and implementation The training and the inference codes of OPUS-TASS and its data are available at https://github.com/thuxugang/opus_tass. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function

Bioinformatics ◽

10.1093/bioinformatics/btaa701 ◽

2020 ◽

Cited By ~ 1

Author(s):

Amelia Villegas-Morcillo ◽

Stavros Makrodimitris ◽

Roeland C H J van Ham ◽

Angel M Gomez ◽

Victoria Sanchez ◽

...

Keyword(s):

Protein Function ◽

Prediction Models ◽

Protein Function Prediction ◽

3D Structure ◽

Function Prediction ◽

Feature Representation ◽

Training Data ◽

Supplementary Information ◽

Molecular Function ◽

Structure Information

Abstract Motivation Protein function prediction is a difficult bioinformatics problem. Many recent methods use deep neural networks to learn complex sequence representations and predict function from these. Deep supervised models require a lot of labeled training data which are not available for this task. However, a very large amount of protein sequences without functional labels is available. Results We applied an existing deep sequence model that had been pretrained in an unsupervised setting on the supervised task of protein molecular function prediction. We found that this complex feature representation is effective for this task, outperforming hand-crafted features such as one-hot encoding of amino acids, k-mer counts, secondary structure and backbone angles. Also, it partly negates the need for complex prediction models, as a two-layer perceptron was enough to achieve competitive performance in the third Critical Assessment of Functional Annotation benchmark. We also show that combining this sequence representation with protein 3D structure information does not lead to performance improvement, hinting that 3D structure is also potentially learned during the unsupervised pretraining. Availability and implementation Implementations of all used models can be found at https://github.com/stamakro/GCN-for-Structure-and-Function. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Accurate prediction of protein secondary structure and solvent accessibility by consensus combiners of sequence and structure information

BMC Bioinformatics ◽

10.1186/1471-2105-8-201 ◽

2007 ◽

Vol 8 (1) ◽

Cited By ~ 74

Author(s):

Gianluca Pollastri ◽

Alberto JM Martin ◽

Catherine Mooney ◽

Alessandro Vullo

Keyword(s):

Secondary Structure ◽

Solvent Accessibility ◽

Protein Secondary Structure ◽

Accurate Prediction ◽

Structure Information

Download Full-text

Ancestral sequence reconstruction: accounting for structural information by averaging over replacement matrices

Bioinformatics ◽

10.1093/bioinformatics/bty1031 ◽

2018 ◽

Vol 35 (15) ◽

pp. 2562-2568

Author(s):

Asher Moshe ◽

Tal Pupko

Keyword(s):

Structural Information ◽

Solvent Accessibility ◽

3D Structure ◽

Three Dimensional ◽

Ancestral Sequence ◽

Supplementary Information ◽

Ancestral Sequence Reconstruction ◽

Ancestral Sequences ◽

Sequence Reconstruction ◽

And Function

Abstract Motivation Ancestral sequence reconstruction (ASR) is widely used to understand protein evolution, structure and function. Current ASR methodologies do not fully consider differences in evolutionary constraints among positions imposed by the three-dimensional (3D) structure of the protein. Here, we developed an ASR algorithm that allows different protein sites to evolve according to different mixtures of replacement matrices. We show that assigning replacement matrices to protein positions based on their solvent accessibility leads to ASR with higher log-likelihoods compared to naïve models that assume a single replacement matrix for all sites. Improved ASR log-likelihoods are also demonstrated when solvent accessibility is predicted from protein sequences rather than inferred from a known 3D structure. Finally, we show that using such structure-aware mixture models results in substantial differences in the inferred ancestral sequences. Availability and implementation http://fastml.tau.ac.il. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

RNANet: an automatically built dual-source dataset integrating homologous sequences and RNA structures

Bioinformatics ◽

10.1093/bioinformatics/btaa944 ◽

2020 ◽

Author(s):

Louis Becquey ◽

Eric Angel ◽

Fariza Tahi

Keyword(s):

Machine Learning ◽

Secondary Structure ◽

Rna Structure ◽

Structure Prediction ◽

Fundamental Problem ◽

3D Structure ◽

Data Gathering ◽

Supplementary Information ◽

Rna Sequences ◽

Scoring Matrices

Abstract Motivation Applied research in machine learning progresses faster when a clean dataset is available and ready to use. Several datasets have been proposed and released over the years for specific tasks such as image classification, speech-recognition and more recently for protein structure prediction. However, for the fundamental problem of RNA structure prediction, information is spread between several databases depending on the level we are interested in: sequence, secondary structure, 3D structure or interactions with other macromolecules. In order to speed-up advances in machine-learning based approaches for RNA secondary and/or 3D structure prediction, a dataset integrating all this information is required, to avoid spending time on data gathering and cleaning. Results Here, we propose the first attempt of a standardized and automatically generated dataset dedicated to RNA combining together: RNA sequences, homology information (under the form of position-specific scoring matrices) and information derived by annotation of available 3D structures (including secondary structure, canonical and non-canonical interactions and backbone torsion angles). The data are retrieved from public databases PDB, Rfam and SILVA. The paper describes the procedure to build such dataset and the RNA structure descriptors we provide. Some statistical descriptions of the resulting dataset are also provided. Availability and implementation The dataset is updated every month and available online (in flat-text file format) on the EvryRNA software platform (https://evryrna.ibisc.univ-evry.fr/evryrna/rnanet). An efficient parallel pipeline to build the dataset is also provided for easy reproduction or modification. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

ccNetViz: a WebGL-based JavaScript library for visualization of large networks

Bioinformatics ◽

10.1093/bioinformatics/btaa559 ◽

2020 ◽

Vol 36 (16) ◽

pp. 4527-4529

Author(s):

Ales Saska ◽

David Tichy ◽

Robert Moore ◽

Achilles Rasquinha ◽

Caner Akdas ◽

...

Keyword(s):

Systems Biology ◽

Complex Networks ◽

Open Source ◽

High Speed ◽

A Priori ◽

Supplementary Information ◽

Network Visualization ◽

Supplementary Data ◽

Web Based ◽

Flow Of Information

Abstract Summary Visualizing a network provides a concise and practical understanding of the information it represents. Open-source web-based libraries help accelerate the creation of biologically based networks and their use. ccNetViz is an open-source, high speed and lightweight JavaScript library for visualization of large and complex networks. It implements customization and analytical features for easy network interpretation. These features include edge and node animations, which illustrate the flow of information through a network as well as node statistics. Properties can be defined a priori or dynamically imported from models and simulations. ccNetViz is thus a network visualization library particularly suited for systems biology. Availability and implementation The ccNetViz library, demos and documentation are freely available at http://helikarlab.github.io/ccNetViz/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Assessing the Impact of Secondary Structure and Solvent Accessibility on Protein Evolution

Genetics ◽

10.1093/genetics/149.1.445 ◽

1998 ◽

Vol 149 (1) ◽

pp. 445-458 ◽

Cited By ~ 21

Author(s):

Nick Goldman ◽

Jeffrey L Thorne ◽

David T Jones

Keyword(s):

Amino Acid ◽

Secondary Structure ◽

Protein Evolution ◽

Solvent Accessibility ◽

Strong Association ◽

Length Distribution ◽

Parametric Bootstrap ◽

Amino Acid Replacement ◽

Physical Constraints ◽

The Impact

Abstract Empirically derived models of amino acid replacement are employed to study the association between various physical features of proteins and evolution. The strengths of these associations are statistically evaluated by applying the models of protein evolution to 11 diverse sets of protein sequences. Parametric bootstrap tests indicate that the solvent accessibility status of a site has a particularly strong association with the process of amino acid replacement that it experiences. Significant association between secondary structure environment and the amino acid replacement process is also observed. Careful description of the length distribution of secondary structure elements and of the organization of secondary structure and solvent accessibility along a protein did not always significantly improve the fit of the evolutionary models to the data sets that were analyzed. As indicated by the strength of the association of both solvent accessibility and secondary structure with amino acid replacement, the process of protein evolution—both above and below the species level—will not be well understood until the physical constraints that affect protein evolution are identified and characterized.

Download Full-text

CATH functional families predict functional sites in proteins

Bioinformatics ◽

10.1093/bioinformatics/btaa937 ◽

2020 ◽

Author(s):

Sayoni Das ◽

Harry M Scholes ◽

Neeladri Sen ◽

Christine Orengo

Keyword(s):

Functional Characterization ◽

Functional Site ◽

Training Data ◽

Supplementary Information ◽

Conserved Residues ◽

Functional Sites ◽

Protein Protein Interaction ◽

Evolutionary Features ◽

Functional Families

Abstract Motivation Identification of functional sites in proteins is essential for functional characterization, variant interpretation and drug design. Several methods are available for predicting either a generic functional site, or specific types of functional site. Here, we present FunSite, a machine learning predictor that identifies catalytic, ligand-binding and protein–protein interaction functional sites using features derived from protein sequence and structure, and evolutionary data from CATH functional families (FunFams). Results FunSite’s prediction performance was rigorously benchmarked using cross-validation and a holdout dataset. FunSite outperformed other publicly available functional site prediction methods. We show that conserved residues in FunFams are enriched in functional sites. We found FunSite’s performance depends greatly on the quality of functional site annotations and the information content of FunFams in the training data. Finally, we analyze which structural and evolutionary features are most predictive for functional sites. Availabilityand implementation https://github.com/UCL/cath-funsite-predictor. Contact [email protected] or [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text