scholarly journals CV-α: designing validations sets to increase the precision and enable multiple comparison tests in genomic prediction

2020 ◽  
Author(s):  
Rafael Massahiro Yassue ◽  
José Felipe Gonzaga Sabadin ◽  
Giovanni Galli ◽  
Filipe Couto Alves ◽  
Roberto Fritsche-Neto

AbstractUsually, the comparison among genomic prediction models is based on validation schemes as Repeated Random Subsampling (RRS) or K-fold cross-validation. Nevertheless, the design of training and validation sets has a high effect on the way and subjectiveness that we compare models. Those procedures cited above have an overlap across replicates that might cause an overestimated estimate and lack of residuals independence due to resampling issues and might cause less accurate results. Furthermore, posthoc tests, such as ANOVA, are not recommended due to assumption unfulfilled regarding residuals independence. Thus, we propose a new way to sample observations to build training and validation sets based on cross-validation alpha-based design (CV-α). The CV-α was meant to create several scenarios of validation (replicates x folds), regardless of the number of treatments. Using CV-α, the number of genotypes in the same fold across replicates was much lower than K-fold, indicating higher residual independence. Therefore, based on the CV-α results, as proof of concept, via ANOVA, we could compare the proposed methodology to RRS and K-fold, applying four genomic prediction models with a simulated and real dataset. Concerning the predictive ability and bias, all validation methods showed similar performance. However, regarding the mean squared error and coefficient of variation, the CV-α method presented the best performance under the evaluated scenarios. Moreover, as it has no additional cost nor complexity, it is more reliable and allows the use of non-subjective methods to compare models and factors. Therefore, CV-α can be considered a more precise validation methodology for model selection.

2009 ◽  
Vol 24 (5) ◽  
pp. 1401-1415 ◽  
Author(s):  
Elizabeth E. Ebert ◽  
William A. Gallus

Abstract The contiguous rain area (CRA) method for spatial forecast verification is a features-based approach that evaluates the properties of forecast rain systems, namely, their location, size, intensity, and finescale pattern. It is one of many recently developed spatial verification approaches that are being evaluated as part of a Spatial Forecast Verification Methods Intercomparison Project. To better understand the strengths and weaknesses of the CRA method, it has been tested here on a set of idealized geometric and perturbed forecasts with known errors, as well as nine precipitation forecasts from three high-resolution numerical weather prediction models. The CRA method was able to identify the known errors for the geometric forecasts, but only after a modification was introduced to allow nonoverlapping forecast and observed features to be matched. For the perturbed cases in which a radar rain field was spatially translated and amplified to simulate forecast errors, the CRA method also reproduced the known errors except when a high-intensity threshold was used to define the CRA (≥10 mm h−1) and a large translation error was imposed (>200 km). The decomposition of total error into displacement, volume, and pattern components reflected the source of the error almost all of the time when a mean squared error formulation was used, but not necessarily when a correlation-based formulation was used. When applied to real forecasts, the CRA method gave similar results when either best-fit criteria, minimization of the mean squared error, or maximization of the correlation coefficient, was chosen for matching forecast and observed features. The diagnosed displacement error was somewhat sensitive to the choice of search distance. Of the many diagnostics produced by this method, the errors in the mean and peak rain rate between the forecast and observed features showed the best correspondence with subjective evaluations of the forecasts, while the spatial correlation coefficient (after matching) did not reflect the subjective judgments.


2021 ◽  
Vol 13 (12) ◽  
pp. 2380
Author(s):  
Antonio-Juan Collados-Lara ◽  
Eulogio Pardo-Igúzquiza ◽  
David Pulido-Velazquez ◽  
Leticia Baena-Ruiz

Satellites produce valuable information for studying the surface water in wetlands, but in many cases the period covered, the spatial resolution and/or the revisit frequency is not enough to produce long historical series. In this paper we propose a novel method which uses regression models that include climatic and hydrological variables to complete the satellite information. We used this method in the Lagunas de Ruidera wetland (Spain). We approached the monthly dynamic of the surface water for a long period (1984–2015). Information from LANDSAT (30-m resolution) and MODIS (250-m resolution) satellites were tested but, due to the size of some lagoons, only the LANDSAT approach produced satisfactory results. An ensemble of regression models based on hydro-climatological explanatory variables was defined to complete the gaps in the monthly surface water. It showed a root mean squared error of around 476 pixels (0.4 Km2) in the cross-validation analysis. Our analysis showed that the explanatory variables with a more significant participation in the regression ensemble are the aquifer discharge, the effective precipitation and the surface water from the previous month. From January to June, the mean surface water in Lagunas de Ruidera is around 4.3 Km2. In summer a reduction of around 13% of the surface water can be observed, which is recovered during the autumn.


F1000Research ◽  
2020 ◽  
Vol 9 ◽  
pp. 113 ◽  
Author(s):  
Marcel Baltruschat ◽  
Paul Czodrowski

We present a small molecule pKa prediction tool entirely written in Python. It predicts the macroscopic pKa value and is trained on a literature compilation of monoprotic compounds. Different machine learning models were tested and random forest performed best given a five-fold cross-validation (mean absolute error=0.682, root mean squared error=1.032, correlation coefficient r2 =0.82). We test our model on two external validation sets, where our model performs comparable to Marvin and is better than a recently published open source model. Our Python tool and all data is freely available at https://github.com/czodrowskilab/Machine-learning-meets-pKa.


2020 ◽  
Vol 3 (1) ◽  
pp. 1
Author(s):  
Geraldo Magela da Cruz Pereira ◽  
Andrew de Paula Ribeiro ◽  
Sebastião Martins Filho

This paper aims at evaluating the use of BLASSO and BayesCπ methods for the genomic prediction of ordinal traits, studying factors that influence the performance of the models, and if there is a difference in the ranking of individuals. Genotypic and phenotypic information from a simulated population of 4,100 animals, genotyped by 10k markers (QTL-MAS Workshop) were used. 3,000 animals were used for estimation of the predictive ability and bias accessed through 5-fold cross-validation with five repetitions. The other animals were used as a population of selection. One ANOVA and the Ryan-Einot-Gabriel-Welch test were performed to verify, respectively, which factors influence significantly the genomic prediction and if there is a statistical difference between the models. The results show that the four main factors significantly (p < 0.05) affect the predictive ability of GEBVs (genomic estimated breeding values), and that heritability and the number of categories are the most influential factors. Only for ordinal trait 2, with a density of 9k, significant differences (p < 0.05) were observed between the predictive ability of the methods. In general, the BayesCπ method proved to be more efficient in the identification of relevant SNPs and in the ranking of individuals. Finally, there is a slight superiority of the BayesCπ method for the genomic prediction of ordinal traits.


Agriculture ◽  
2021 ◽  
Vol 11 (10) ◽  
pp. 932
Author(s):  
Reyna Persa ◽  
Martin Grondona ◽  
Diego Jarquin

The global growing population is experiencing challenges to satisfy the food chain supply in a world that faces rapid changes in environmental conditions complicating the development of stable cultivars. Emergent methodologies aided by molecular marker information such as marker assisted selection (MAS) and genomic selection (GS) have been widely adopted to assist the development of improved genotypes. In general, the implementation of GS is not straightforward, and it usually requires cross-validation studies to find the optimum set of factors (training set sizes, number of markers, quality controls, etc.) to use in real breeding applications. In most cases, these different scenarios (combination of several factors) vary just in the levels of a single factor keeping fixed the other levels of the other factors allowing the use of previously developed routines (code reuse). In this study we present a set of structured modules than are easily to assemble for constructing complex genomic prediction pipelines from scratch. Also, we proposed a novel method for selecting training-testing sets of similar sample sizes across different cross-validation schemes (CV2, predicting tested genotypes in observed environments; CV1, predicting untested genotypes in observed environments; CV0, predicting tested genotypes in novel environments; and CV00, predicting untested genotypes in novel environments). To show how our implementation works, we considered two real data sets. These correspond to selected samples of the USDA soybean collection (D1: 324 genotypes observed in 6 environments scored for 9 traits) and of the Soybean Nested Association Mapping (SoyNAM) experiment (D2: 324 genotypes observed in 6 environments scored for 6 traits). In addition, three prediction models which consider the effect of environments and lines (M1: E + L), environments, lines and main effect of markers (M2: E + L + G), and also the inclusion of the interaction between makers and environments (M3: E + L + G + G×E) were considered. The results confirm that under CV2 and CV1 schemes, moderate improvements in predictive ability can be obtained with the inclusion of the interaction component, while for CV0 mixed results were observed, and for CV00 no improvements were shown. However, for this last scenario the inclusion of weather and soil data potentially could enhance the results of the interaction model.


F1000Research ◽  
2020 ◽  
Vol 9 ◽  
pp. 113
Author(s):  
Marcel Baltruschat ◽  
Paul Czodrowski

We present a small molecule pKa prediction tool entirely written in Python. It predicts the macroscopic pKa value and is trained on a literature compilation of monoprotic compounds. Different machine learning models were tested and random forest performed best given a five-fold cross-validation (mean absolute error=0.682, root mean squared error=1.032, correlation coefficient r2 =0.82). We test our model on two external validation sets, where our model performs comparable to Marvin and is better than a recently published open source model. Our Python tool and all data is freely available at https://github.com/czodrowskilab/Machine-learning-meets-pKa.


2017 ◽  
Author(s):  
Ashley I. Naimi ◽  
Laura B. Balzer

AbstractStacked generalization is an ensemble method that allows researchers to combine several different prediction algorithms into one. Since its introduction in the early 1990s, the method has evolved several times into what is now known as “Super Learner”. Super Learner uses V -fold cross-validation to build the optimal weighted combination of predictions from a library of candidate algorithms. Optimality is defined by a user-specified objective function, such as minimizing mean squared error or maximizing the area under the receiver operating characteristic curve. Although relatively simple in nature, use of the Super Learner by epidemiologists has been hampered by limitations in understanding conceptual and technical details. We work step-by-step through two examples to illustrate concepts and address common concerns.


2011 ◽  
Vol 60 (2) ◽  
pp. 248-255 ◽  
Author(s):  
Sangmun Shin ◽  
Funda Samanlioglu ◽  
Byung Rae Cho ◽  
Margaret M. Wiecek

Sign in / Sign up

Export Citation Format

Share Document