Accuracy of protein-level disorder predictions

2019 ◽  
Vol 21 (5) ◽  
pp. 1509-1522 ◽  
Author(s):  
Akila Katuwawala ◽  
Christopher J Oldfield ◽  
Lukasz Kurgan

Abstract Experimental annotations of intrinsic disorder are available for 0.1% of 147 000 000 of currently sequenced proteins. Over 60 sequence-based disorder predictors were developed to help bridge this gap. Current benchmarks of these methods assess predictive performance on datasets of proteins; however, predictions are often interpreted for individual proteins. We demonstrate that the protein-level predictive performance varies substantially from the dataset-level benchmarks. Thus, we perform first-of-its-kind protein-level assessment for 13 popular disorder predictors using 6200 disorder-annotated proteins. We show that the protein-level distributions are substantially skewed toward high predictive quality while having long tails of poor predictions. Consequently, between 57% and 75% proteins secure higher predictive performance than the currently used dataset-level assessment suggests, but as many as 30% of proteins that are located in the long tails suffer low predictive performance. These proteins typically have relatively high amounts of disorder, in contrast to the mostly structured proteins that are predicted accurately by all 13 methods. Interestingly, each predictor provides the most accurate results for some number of proteins, while the best-performing at the dataset-level method is in fact the best for only about 30% of proteins. Moreover, the majority of proteins are predicted more accurately than the dataset-level performance of the most accurate tool by at least four disorder predictors. While these results suggests that disorder predictors outperform their current benchmark performance for the majority of proteins and that they complement each other, novel tools that accurately identify the hard-to-predict proteins and that make accurate predictions for these proteins are needed.

2018 ◽  
Vol 35 (10) ◽  
pp. 1692-1700 ◽  
Author(s):  
Gang Hu ◽  
Zhonghua Wu ◽  
Christopher J Oldfield ◽  
Chen Wang ◽  
Lukasz Kurgan

Abstract Motivation While putative intrinsic disorder is widely used, none of the predictors provides quality assessment (QA) scores. QA scores estimate the likelihood that predictions are correct at a residue level and have been applied in other bioinformatics areas. We recently reported that QA scores derived from putative disorder propensities perform relatively poorly for native disordered residues. Here we design and validate a general approach to construct QA predictors for disorder predictions. Results The QUARTER (QUality Assessment for pRotein inTrinsic disordEr pRedictions) toolbox of methods accommodates a diverse set of ten disorder predictors. It builds upon several innovative design elements including use and scaling of selected physicochemical properties of the input sequence, post-processing of disorder propensity scores, and a feature selection that optimizes the predictive models to a specific disorder predictor. We empirically establish that each one of these elements contributes to the overall predictive performance of our tool and that QUARTER’s outputs significantly outperform QA scores derived from the outputs generated the disorder predictors. The best performing QA scores for a single disorder predictor identify 13% of residues that are predicted with 98% precision. QA scores computed by combining results of the ten disorder predictors cover 40% of residues with 95% precision. Case studies are used to show how to interpret the QA scores. QA scores based on the high precision combined predictions are applied to analyze disorder in the human proteome. Availability and implementation http://biomine.cs.vcu.edu/servers/QUARTER/ Supplementary information Supplementary data are available at Bioinformatics online.


2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Gang Hu ◽  
Akila Katuwawala ◽  
Kui Wang ◽  
Zhonghua Wu ◽  
Sina Ghadermarzi ◽  
...  

AbstractIdentification of intrinsic disorder in proteins relies in large part on computational predictors, which demands that their accuracy should be high. Since intrinsic disorder carries out a broad range of cellular functions, it is desirable to couple the disorder and disorder function predictions. We report a computational tool, flDPnn, that provides accurate, fast and comprehensive disorder and disorder function predictions from protein sequences. The recent Critical Assessment of protein Intrinsic Disorder prediction (CAID) experiment and results on other test datasets demonstrate that flDPnn offers accurate predictions of disorder, fully disordered proteins and four common disorder functions. These predictions are substantially better than the results of the existing disorder predictors and methods that predict functions of disorder. Ablation tests reveal that the high predictive performance stems from innovative ways used in flDPnn to derive sequence profiles and encode inputs. flDPnn’s webserver is available at http://biomine.cs.vcu.edu/servers/flDPnn/


Fire ◽  
2020 ◽  
Vol 3 (4) ◽  
pp. 71
Author(s):  
Cory W. Ott ◽  
Bishrant Adhikari ◽  
Simon P. Alexander ◽  
Paddington Hodza ◽  
Chen Xu ◽  
...  

The scope of wildfires over the previous decade has brought these natural hazards to the forefront of risk management. Wildfires threaten human health, safety, and property, and there is a need for comprehensive and readily usable wildfire simulation platforms that can be applied effectively by wildfire experts to help preserve physical infrastructure, biodiversity, and landscape integrity. Evaluating such platforms is important, particularly in determining the platforms’ reliability in forecasting the spatiotemporal trajectories of wildfire events. This study evaluated the predictive performance of a wildfire simulation platform that implements a Monte Carlo-based wildfire model called WyoFire. WyoFire was used to predict the growth of 10 wildfires that occurred in Wyoming, USA, in 2017 and 2019. The predictive quality of this model was determined by comparing disagreement and agreement areas between the observed and simulated wildfire boundaries. Overestimation–underestimation was greatest in grassland fires (>32) and lowest in mixed-forest, woodland, and shrub-steppe fires (<−2.5). Spatial and statistical analyses of observed and predicted fire perimeters were conducted to measure the accuracy of the predicated outputs. The results indicate that simulations of wildfires that occurred in shrubland- and grassland-dominated environments had the tendency to over-predict, while simulations of fires that took place within forested and woodland-dominated environments displayed the tendency to under-predict.


2018 ◽  
Author(s):  
Da Chen Emily Koo ◽  
Richard Bonneau

AbstractMotivationDue to the nature of experimental annotation, most protein function prediction methods operate at the protein-level, where functions are assigned to full-length proteins based on overall similarities. However, most proteins function by interacting with other proteins or molecules, and many functional associations should be limited to specific regions rather than the entire protein length. Most domain-centric function prediction methods depend on accurate domain family assignments to infer relationships between domains and functions, with regions that are unassigned to a known domain-family left out of functional evaluation. Given the abundance of residue-level annotations currently available, we present a function prediction methodology that automatically infers function labels of specific protein regions using protein-level annotations and multiple types of region-specific features.ResultsWe apply this method to local features obtained from InterPro, UniProtKB and amino acid sequences and show that this method improves both the accuracy and region-specificity of protein function transfer and prediction by testing on both human and yeast proteomes. We compare region-level predictive performance of our method against that of a whole-protein baseline method using a held-out dataset of proteins with structurally-verified binding sites and also compare protein-level temporal holdout predictive performances to expand the variety and specificity of GO terms we could evaluate. Our results can also serve as a starting point to categorize GO terms into site-specific and whole-protein terms and select prediction methods for different classes of GO terms.AvailabilityThe code is freely available at: https://github.com/ek1203/region_spec_func_pred


2020 ◽  
Vol 98 (Supplement_4) ◽  
pp. 178-178
Author(s):  
Arthur Francisco Araujo Fernandes ◽  
João R R Dorea ◽  
Bruno D Valente ◽  
Robert Fitzgerald ◽  
William O Herring ◽  
...  

Abstract The measurement of carcass traits in live pigs, such as muscle depth (MD) and backfat thickness (BF), is a topic of great interest for breeding companies and production farms. Breeding companies currently measure MD and BF using medical imaging technologies such as ultrasound (US). However, US is costly, requires trained personnel, and involves direct interaction with the animals, which is an added stressor. An interesting alternative in this regard is to use computer vision techniques. Farmers would also take advantage of such an application as they would be able to better adjust feed composition and delivery. Therefore, the objectives of this study were: (1) to develop a computer vision system for prediction of MD and BF from 3D images of finishing pigs; (2) to compare the predictive ability of statistical (multiple linear regression, partial least squares) and machine learning (elastic networks and artificial neural networks) approaches using features extracted from the images against a deep learning (DL) approach that uses the raw image as input. A dataset containing 3D images and ultrasound measurements of 618 pigs with average body weight of 120 kg, MD of 65 mm, and BF of 6 mm was used in this study. To assess the predictive performance of the different strategies, a 5-fold cross-validation approach was used. The DL achieved the best predictive performance for both traits, with predictive mean absolute scaled error (MASE) of 5.10% and 13.62%, root-mean-square error (RMSE) of 4.35mm and 1.10mm, and R2 of 0.51 and 0.45, for MD and BF respectively. In conclusion, it was demonstrated that it is possible to satisfactorily predict MD and BF using 3D images that were autonomously collected in farm conditions. Also, the best predictive quality was achieved by a DL approach, simplifying the data workflow as it uses raw 3D images as inputs.


Biomolecules ◽  
2021 ◽  
Vol 11 (9) ◽  
pp. 1337
Author(s):  
Ruiyang Song ◽  
Baixin Cao ◽  
Zhenling Peng ◽  
Christopher J. Oldfield ◽  
Lukasz Kurgan ◽  
...  

Non-synonymous single nucleotide polymorphisms (nsSNPs) may result in pathogenic changes that are associated with human diseases. Accurate prediction of these deleterious nsSNPs is in high demand. The existing predictors of deleterious nsSNPs secure modest levels of predictive performance, leaving room for improvements. We propose a new sequence-based predictor, DMBS, which addresses the need to improve the predictive quality. The design of DMBS relies on the observation that the deleterious mutations are likely to occur at the highly conserved and functionally important positions in the protein sequence. Correspondingly, we introduce two innovative components. First, we improve the estimates of the conservation computed from the multiple sequence profiles based on two complementary databases and two complementary alignment algorithms. Second, we utilize putative annotations of functional/binding residues produced by two state-of-the-art sequence-based methods. These inputs are processed by a random forests model that provides favorable predictive performance when empirically compared against five other machine-learning algorithms. Empirical results on four benchmark datasets reveal that DMBS achieves AUC > 0.94, outperforming current methods, including protein structure-based approaches. In particular, DMBS secures AUC = 0.97 for the SNPdbe and ExoVar datasets, compared to AUC = 0.70 and 0.88, respectively, that were obtained by the best available methods. Further tests on the independent HumVar dataset shows that our method significantly outperforms the state-of-the-art method SNPdryad. We conclude that DMBS provides accurate predictions that can effectively guide wet-lab experiments in a high-throughput manner.


Biomolecules ◽  
2020 ◽  
Vol 10 (12) ◽  
pp. 1636
Author(s):  
Akila Katuwawala ◽  
Lukasz Kurgan

With over 60 disorder predictors, users need help navigating the predictor selection task. We review 28 surveys of disorder predictors, showing that only 11 include assessment of predictive performance. We identify and address a few drawbacks of these past surveys. To this end, we release a novel benchmark dataset with reduced similarity to the training sets of the considered predictors. We use this dataset to perform a first-of-its-kind comparative analysis that targets two large functional families of disordered proteins that interact with proteins and with nucleic acids. We show that limiting sequence similarity between the benchmark and the training datasets has a substantial impact on predictive performance. We also demonstrate that predictive quality is sensitive to the use of the well-annotated order and inclusion of the fully structured proteins in the benchmark datasets, both of which should be considered in future assessments. We identify three predictors that provide favorable results using the new benchmark set. While we find that VSL2B offers the most accurate and robust results overall, ESpritz-DisProt and SPOT-Disorder perform particularly well for disordered proteins. Moreover, we find that predictions for the disordered protein-binding proteins suffer low predictive quality compared to generic disordered proteins and the disordered nucleic acids-binding proteins. This can be explained by the high disorder content of the disordered protein-binding proteins, which makes it difficult for the current methods to accurately identify ordered regions in these proteins. This finding motivates the development of a new generation of methods that would target these difficult-to-predict disordered proteins. We also discuss resources that support users in collecting and identifying high-quality disorder predictions.


2006 ◽  
Vol 3 (2) ◽  
pp. 230-246 ◽  
Author(s):  
Fiona Browne ◽  
Haiying Wang ◽  
Huiru Zheng ◽  
Francisco Azuaje

Abstract Protein-protein interactions (PPI) play a key role in many biological systems. Over the past few years, an explosion in availability of functional biological data obtained from high-throughput technologies to infer PPI has been observed. However, results obtained from such experiments show high rates of false positives and false negatives predictions as well as systematic predictive bias. Recent research has revealed that several machine and statistical learning methods applied to integrate relatively weak, diverse sources of large-scale functional data may provide improved predictive accuracy and coverage of PPI. In this paper we describe the effects of applying different computational, integrative methods to predict PPI in Saccharomyces cerevisiae. We investigated the predictive ability of combining different sets of relatively strong and weak predictive datasets. We analysed several genomic datasets ranging from mRNA co-expression to marginal essentiality. Moreover, we expanded an existing multi-source dataset from S. cerevisiae by constructing a new set of putative interactions extracted from Gene Ontology (GO)- driven annotations in the Saccharomyces Genome Database. Different classification techniques: Simple Naive Bayesian (SNB), Multilayer Perceptron (MLP) and K-Nearest Neighbors (KNN) were evaluated. Relatively simple classification methods (i.e. less computing intensive and mathematically complex), such as SNB, have been proven to be proficient at predicting PPI. SNB produced the “highest” predictive quality obtaining an area under Receiver Operating Characteristic (ROC) curve (AUC) value of 0.99. The lowest AUC value of 0.90 was obtained by the KNN classifier. This assessment also demonstrates the strong predictive power of GO-driven models, which offered predictive performance above 0.90 using the different machine learning and statistical techniques. As the predictive power of single-source datasets became weaker MLP and SNB performed better than KNN. Moreover, predictive performance saturation may be reached independently of the classification models applied, which may be explained by predictive bias and incompleteness of existing “Gold Standards”. More comprehensive and accurate PPI maps will be produced for S. cerevisiae and beyond with the emergence of largescale datasets of better predictive quality and the integration of intelligent classification methods.


2020 ◽  
Vol 36 (Supplement_2) ◽  
pp. i735-i744
Author(s):  
Fuhao Zhang ◽  
Wenbo Shi ◽  
Jian Zhang ◽  
Min Zeng ◽  
Min Li ◽  
...  

Abstract Motivation Knowledge of protein-binding residues (PBRs) improves our understanding of protein−protein interactions, contributes to the prediction of protein functions and facilitates protein−protein docking calculations. While many sequence-based predictors of PBRs were published, they offer modest levels of predictive performance and most of them cross-predict residues that interact with other partners. One unexplored option to improve the predictive quality is to design consensus predictors that combine results produced by multiple methods. Results We empirically investigate predictive performance of a representative set of nine predictors of PBRs. We report substantial differences in predictive quality when these methods are used to predict individual proteins, which contrast with the dataset-level benchmarks that are currently used to assess and compare these methods. Our analysis provides new insights for the cross-prediction concern, dissects complementarity between predictors and demonstrates that predictive performance of the top methods depends on unique characteristics of the input protein sequence. Using these insights, we developed PROBselect, first-of-its-kind consensus predictor of PBRs. Our design is based on the dynamic predictor selection at the protein level, where the selection relies on regression-based models that accurately estimate predictive performance of selected predictors directly from the sequence. Empirical assessment using a low-similarity test dataset shows that PROBselect provides significantly improved predictive quality when compared with the current predictors and conventional consensuses that combine residue-level predictions. Moreover, PROBselect informs the users about the expected predictive quality for the prediction generated from a given input protein. Availability and implementation PROBselect is available at http://bioinformatics.csu.edu.cn/PROBselect/home/index. Supplementary information Supplementary data are available at Bioinformatics online.


Sign in / Sign up

Export Citation Format

Share Document