scholarly journals Development of an experiment-split method for benchmarking the generalization of a PTM site predictor: Lysine methylome as an example

2021 ◽  
Vol 17 (12) ◽  
pp. e1009682
Author(s):  
Guoyang Zou ◽  
Yang Zou ◽  
Chenglong Ma ◽  
Jiaojiao Zhao ◽  
Lei Li

Many computational classifiers have been developed to predict different types of post-translational modification sites. Their performances are measured using cross-validation or independent test, in which experimental data from different sources are mixed and randomly split into training and test sets. However, the self-reported performances of most classifiers based on this measure are generally higher than their performances in the application of new experimental data. It suggests that the cross-validation method overestimates the generalization ability of a classifier. Here, we proposed a generalization estimate method, dubbed experiment-split test, where the experimental sources for the training set are different from those for the test set that simulate the data derived from a new experiment. We took the prediction of lysine methylome (Kme) as an example and developed a deep learning-based Kme site predictor (called DeepKme) with outstanding performance. We assessed the experiment-split test by comparing it with the cross-validation method. We found that the performance measured using the experiment-split test is lower than that measured in terms of cross-validation. As the test data of the experiment-split method were derived from an independent experimental source, this method could reflect the generalization of the predictor. Therefore, we believe that the experiment-split method can be applied to benchmark the practical performance of a given PTM model. DeepKme is free accessible via https://github.com/guoyangzou/DeepKme.

2021 ◽  
Author(s):  
Guoyang Zou ◽  
Lei Li

A Large number of predictors have been built based on different data sets for predicting different post-translational modification sites. However, limited to our knowledge, most of them gave an overfitting estimation of their generalization ability in new data because of the intrinsic trait—not considering the experimental sources of the new data—of the cross-validation method. Thus, we proposed and explored a new method—the experiment-split method—imitating the blinded assessment to deal with the overfitting problem in the new data. The experiment-split method logically split the training and test data based on the data’s different experimental sources, and the new data can be regarded as the data from different experimental sources. To specifically illustrate the experiment-split method, we combined an actual application, DeepKme—a predictor built by us for the lysine methylation sites, to demonstrate how it be used in the true scenarios. We compared the cross-validation method with the experiment-split method. The result suggested the experiment-split method could effectively relieve the overfitting compared with the cross-validation method and may be widely used in the field of identification participated by multiple experiments. We believe DeepKme would facilitate the related researchers’ deep thought of the experiment-split method and the overfitting phenomenon, and of course, advance the study of the lysine methylation and similar fields.


2020 ◽  
Vol 17 ◽  
Author(s):  
Hongwei Liu ◽  
Bin Hu ◽  
Lei Chen ◽  
Lin Lu

Background: Identification of protein subcellular location is an important problem because the subcellular location is highly related to protein function. It is fundamental to determine the locations with biology experiments. However, these experiments are of high costs and time-consuming. The alternative way to address such problem is to design effective computational methods. Objective: To date, several computational methods have been proposed in this regard. However, these methods mainly adopted the features derived from proteins themselves. On the other hand, with the development of network technique, several embedding algorithms have been proposed, which can encode nodes in the network into feature vectors. Such algorithms connected the network and traditional classification algorithms. Thus, they provided a new way to construct models for the prediction of protein subcellular location. Method: In this study, we analyzed features produced by three network embedding algorithms (DeepWalk, Node2vec and Mashup) that were applied on one or multiple protein networks. Obtained features were learned by one machine learning algorithm (support vector machine or random forest) to construct the model. The cross-validation method was adopted to evaluate all constructed models. Results: After evaluating models with the cross-validation method, embedding features yielded by Mashup on multiple networks were quite informative for predicting protein subcellular location. The model based on these features were superior to some classic models. Conclusion: Embedding features yielded by a proper and powerful network embedding algorithm were effective for building the model for prediction of protein subcellular location, providing new pipelines to build more efficient models.


Author(s):  
Jae Young Lee ◽  
Martin Röösli ◽  
Martina S. Ragettli

This study presents a novel method for estimating the heat-attributable fractions (HAF) based on the cross-validated best temperature metric. We analyzed the association of eight temperature metrics (mean, maximum, minimum temperature, maximum temperature during daytime, minimum temperature during nighttime, and mean, maximum, and minimum apparent temperature) with mortality and performed the cross-validation method to select the best model in selected cities of Switzerland and South Korea from May to September of 1995–2015. It was observed that HAF estimated using different metrics varied by 2.69–4.09% in eight cities of Switzerland and by 0.61–0.90% in six cities of South Korea. Based on the cross-validation method, mean temperature was estimated to be the best metric, and it revealed that the HAF of Switzerland and South Korea were 3.29% and 0.72%, respectively. Furthermore, estimates of HAF were improved by selecting the best city-specific model for each city, that is, 3.34% for Switzerland and 0.78% for South Korea. To the best of our knowledge, this study is the first to observe the uncertainty of HAF estimation originated from the selection of temperature metric and to present the HAF estimation based on the cross-validation method.


2015 ◽  
Vol 9 (1) ◽  
pp. 107-114
Author(s):  
Zhou Shengquan ◽  
Zhao Xiaolong ◽  
Yao Zhaoming

In order to forecast the displacement of deep foundation pit support, this document proposes a new method which combines the cross validation method and supports vector machine (SVM) based on random small samples. Because the random small monitoring data are difficult to fit and forecast, the cross validation method and different kernel function of support vector machine algorithm arerepeatedly used to establish and optimize the displacement prediction model of underground continuous wall, and then uses validation samples to test the accuracy of the models. The results show that this method can meet the requirements of precision relatively well, and Cauchy kernel function is better than the other. In the aspect of accuracy of model fitting and prediction, this method has great advantages, which can be applied to practical engineering.


2017 ◽  
Vol 33 (4) ◽  
pp. 543-549 ◽  
Author(s):  
Bernardo Gomes Nörenberg ◽  
Lessandro Coll Faria ◽  
Osvaldo Rettore Neto ◽  
Samuel Beskow ◽  
Alberto Colombo ◽  
...  

Abstract. In order to develop models for representation of Christiansen’s Uniformity (CU) and Distribution Uniformity (DU) as a function of wind speed, 32 in-field tests evaluating a mechanical lateral-move irrigation system, used in rice production, were carried out in southern Rio Grande do Sul, Brazil. These tests were used to generate two third-order polynomial models for estimation of CU and DU, which were then validated based on a cross-validation approach. The generated models had their accuracy quantified by means of the following statistical measures: determination coefficient (R2), reliability and performance index (c), root mean square error (RMSE), and Nash-Sutcliffe coefficient (CNS). Wind direction had no significant influence on CU and DU. The CU values estimated from the cross-validation method were compared to those observed, resulting in R2 = 0.44, c = 0.53, RMSE = 1.82%, and CNS = 0.43. Likewise, DU values estimated from the cross-validation method were compared to the observed values, culminating in R2, c, RMSE, and CNS equal to 0.41%, 0.51%, 2.81% and 0.40%, respectively. The models developed in this study can be useful as a support tool for decision making when applying mechanical lateral-move irrigation systems, allowing estimation of CU and DU values with satisfactory precision for wind speeds less than 5.5 m s-1. Keywords: In-field tests, Rice, Sprinkler irrigation.


2015 ◽  
Vol 9 (1) ◽  
pp. 53-60
Author(s):  
Zhou Shengquan ◽  
Zhao Xiaolong ◽  
Yao Zhaoming

In order to forecast the displacement of deep foundation pit support, this document proposes a new method which combines the cross validation method and supports vector machine (SVM) based on random small samples.Because the random small monitoring data are difficult to fit and forecast, the cross validation method and different kernel function of support vector machine algorithm arerepeatedly used to establish and optimize the displacement prediction model of underground continuous wall, and then uses validation samples to test the accuracy of the models. The results show that this method can meet the requirements of precision relatively well, and Cauchy kernel function is better than the other. In the aspect of accuracy of model fitting and prediction, this method has great advantages, which can be applied to practical engineering.


Author(s):  
V. M. Nedel’ko ◽  

In this work we will study the accuracy of the cross-validation estimates for decision functions. The main idea of the research consists in the scheme of statistical modeling that allows using real data to obtain statistical estimates, which are usually obtained only by using model (synthetic) distributions. The studies confirm the well-known empirical recommendation to choose the number of folds equal to 5 or more. The choice of more than 10 folds does not yield a significant increase in accuracy. The use of repeated cross-validation also does not provide fundamental gain in precision. The results of the experiments allow us to formulate an empirical fact that the accuracy of the estimates obtained by the cross-validation method is approximately the same as the accuracy of the estimates obtained from the test sample of half the size. This result can be easily explained by the fact that all the objects of the test sample are independent, and the estimates built by the cross-validation on different subsamples (folds) are not independent.


Sign in / Sign up

Export Citation Format

Share Document