scholarly journals Efficiency of genomic prediction of non-assessed single crosses

2017 ◽  
Author(s):  
José Marcelo Soriano Viana ◽  
Helcio Duarte Pereira ◽  
Gabriel Borges Mundim ◽  
Hans-Peter Piepho ◽  
Fabyano Fonseca e Silva

ABSTRACTAn important application of genomic selection in plant breeding is the prediction of untested single crosses (SCs). Most investigations on the prediction efficiency were based on tested SCs, using cross-validation. The main objective was to assess the prediction efficiency by correlating the predicted and true genotypic values of untested SCs (accuracy) and measuring the efficacy of identification of the best 300 untested SCs (coincidence), using simulated data. We assumed 10,000 SNPs, 400 QTLs, two groups of 70 selected DH lines, and 4,900 SCs. The heritabilities for the assessed SCs were 30, 60 and 100%. The scenarios included three sampling processes of DH lines, two sampling processes of SCs for testing, two SNP densities, DH lines from distinct and same populations, DH lines from populations with lower LD, two genetic models, three statistical models, and three statistical approaches. We derived a model for genomic prediction based on SNP average effects of substitution and dominance deviations. The prediction accuracy is not affected by the linkage phase. The prediction of untested SCs is very efficient. The accuracies and coincidences ranged from approximately 0.8 and 0.5, respectively, under low heritability, to 0.9 and 0.7, assuming high heritability. Additionally, we highlighted the relevance of the overall LD and evidenced that efficient prediction of untested SCs can be achieved for crops that show no heterotic pattern, for reduced training set size (10%), for SNP density of 1 cM, and for distinct sampling processes of DH lines, based on random choice of the SCs for testing.


PLoS ONE ◽  
2013 ◽  
Vol 8 (12) ◽  
pp. e81046 ◽  
Author(s):  
Malena Erbe ◽  
Birgit Gredler ◽  
Franz Reinhold Seefried ◽  
Beat Bapst ◽  
Henner Simianer


2021 ◽  
Vol 12 ◽  
Author(s):  
Bader Arouisse ◽  
Tom P. J. M. Theeuwen ◽  
Fred A. van Eeuwijk ◽  
Willem Kruijer

In the past decades, genomic prediction has had a large impact on plant breeding. Given the current advances of high-throughput phenotyping and sequencing technologies, it is increasingly common to observe a large number of traits, in addition to the target trait of interest. This raises the important question whether these additional or “secondary” traits can be used to improve genomic prediction for the target trait. With only a small number of secondary traits, this is known to be the case, given sufficiently high heritabilities and genetic correlations. Here we focus on the more challenging situation with a large number of secondary traits, which is increasingly common since the arrival of high-throughput phenotyping. In this case, secondary traits are usually incorporated through additional relatedness matrices. This approach is however infeasible when secondary traits are not measured on the test set, and cannot distinguish between genetic and non-genetic correlations. An alternative direction is to extend the classical selection indices using penalized regression. So far, penalized selection indices have not been applied in a genomic prediction setting, and require plot-level data in order to reliably estimate genetic correlations. Here we aim to overcome these limitations, using two novel approaches. Our first approach relies on a dimension reduction of the secondary traits, using either penalized regression or random forests (LS-BLUP/RF-BLUP). We then compute the bivariate GBLUP with the dimension reduction as secondary trait. For simulated data (with available plot-level data), we also use bivariate GBLUP with the penalized selection index as secondary trait (SI-BLUP). In our second approach (GM-BLUP), we follow existing multi-kernel methods but replace secondary traits by their genomic predictions, with the advantage that genomic prediction is also possible when secondary traits are only measured on the training set. For most of our simulated data, SI-BLUP was most accurate, often closely followed by RF-BLUP or LS-BLUP. In real datasets, involving metabolites in Arabidopsis and transcriptomics in maize, no method could substantially improve over univariate prediction when secondary traits were only available on the training set. LS-BLUP and RF-BLUP were most accurate when secondary traits were available also for the test set.



2011 ◽  
Author(s):  
Jeffrey S. Katz ◽  
John F. Magnotti ◽  
Anthony A. Wright


2021 ◽  
Vol 13 (3) ◽  
pp. 368
Author(s):  
Christopher A. Ramezan ◽  
Timothy A. Warner ◽  
Aaron E. Maxwell ◽  
Bradley S. Price

The size of the training data set is a major determinant of classification accuracy. Nevertheless, the collection of a large training data set for supervised classifiers can be a challenge, especially for studies covering a large area, which may be typical of many real-world applied projects. This work investigates how variations in training set size, ranging from a large sample size (n = 10,000) to a very small sample size (n = 40), affect the performance of six supervised machine-learning algorithms applied to classify large-area high-spatial-resolution (HR) (1–5 m) remotely sensed data within the context of a geographic object-based image analysis (GEOBIA) approach. GEOBIA, in which adjacent similar pixels are grouped into image-objects that form the unit of the classification, offers the potential benefit of allowing multiple additional variables, such as measures of object geometry and texture, thus increasing the dimensionality of the classification input data. The six supervised machine-learning algorithms are support vector machines (SVM), random forests (RF), k-nearest neighbors (k-NN), single-layer perceptron neural networks (NEU), learning vector quantization (LVQ), and gradient-boosted trees (GBM). RF, the algorithm with the highest overall accuracy, was notable for its negligible decrease in overall accuracy, 1.0%, when training sample size decreased from 10,000 to 315 samples. GBM provided similar overall accuracy to RF; however, the algorithm was very expensive in terms of training time and computational resources, especially with large training sets. In contrast to RF and GBM, NEU, and SVM were particularly sensitive to decreasing sample size, with NEU classifications generally producing overall accuracies that were on average slightly higher than SVM classifications for larger sample sizes, but lower than SVM for the smallest sample sizes. NEU however required a longer processing time. The k-NN classifier saw less of a drop in overall accuracy than NEU and SVM as training set size decreased; however, the overall accuracies of k-NN were typically less than RF, NEU, and SVM classifiers. LVQ generally had the lowest overall accuracy of all six methods, but was relatively insensitive to sample size, down to the smallest sample sizes. Overall, due to its relatively high accuracy with small training sample sets, and minimal variations in overall accuracy between very large and small sample sets, as well as relatively short processing time, RF was a good classifier for large-area land-cover classifications of HR remotely sensed data, especially when training data are scarce. However, as performance of different supervised classifiers varies in response to training set size, investigating multiple classification algorithms is recommended to achieve optimal accuracy for a project.



2021 ◽  
Vol 12 ◽  
Author(s):  
Marlee R. Labroo ◽  
Jauhar Ali ◽  
M. Umair Aslam ◽  
Erik Jon de Asis ◽  
Madonna A. dela Paz ◽  
...  

Hybrid rice varieties can outyield the best inbred varieties by 15 – 30% with appropriate management. However, hybrid rice requires more inputs and management than inbred rice to realize a yield advantage in high-yielding environments. The development of stress-tolerant hybrid rice with lowered input requirements could increase hybrid rice yield relative to production costs. We used genomic prediction to evaluate the combining abilities of 564 stress-tolerant lines used to develop Green Super Rice with 13 male sterile lines of the International Rice Research Institute for yield-related traits. We also evaluated the performance of their F1 hybrids. We identified male sterile lines with good combining ability as well as F1 hybrids with potential further use in product development. For yield per plant, accuracies of genomic predictions of hybrid genetic values ranged from 0.490 to 0.822 in cross-validation if neither parent or up to both parents were included in the training set, and both general and specific combining abilities were modeled. The accuracy of phenotypic selection for hybrid yield per plant was 0.682. The accuracy of genomic predictions of male GCA for yield per plant was 0.241, while the accuracy of phenotypic selection was 0.562. At the observed accuracies, genomic prediction of hybrid genetic value could allow improved identification of high-performing single crosses. In a reciprocal recurrent genomic selection program with an accelerated breeding cycle, observed male GCA genomic prediction accuracies would lead to similar rates of genetic gain as phenotypic selection. It is likely that prediction accuracies of male GCA could be improved further by targeted expansion of the training set. Additionally, we tested the correlation of parental genetic distance with mid-parent heterosis in the phenotyped hybrids. We found the average mid-parent heterosis for yield per plant to be consistent with existing literature values at 32.0%. In the overall population of study, parental genetic distance was significantly negatively correlated with mid-parent heterosis for yield per plant (r = −0.131) and potential yield (r = −0.092), but within female families the correlations were non-significant and near zero. As such, positive parental genetic distance was not reliably associated with positive mid-parent heterosis.



2020 ◽  
Author(s):  
Fanny Mollandin ◽  
Andrea Rau ◽  
Pascal Croiseau

ABSTRACTTechnological advances and decreasing costs have led to the rise of increasingly dense genotyping data, making feasible the identification of potential causal markers. Custom genotyping chips, which combine medium-density genotypes with a custom genotype panel, can capitalize on these candidates to potentially yield improved accuracy and interpretability in genomic prediction. A particularly promising model to this end is BayesR, which divides markers into four effect size classes. BayesR has been shown to yield accurate predictions and promise for quantitative trait loci (QTL) mapping in real data applications, but an extensive benchmarking in simulated data is currently lacking. Based on a set of real genotypes, we generated simulated data under a variety of genetic architectures, phenotype heritabilities, and we evaluated the impact of excluding or including causal markers among the genotypes. We define several statistical criteria for QTL mapping, including several based on sliding windows to account for linkage disequilibrium. We compare and contrast these statistics and their ability to accurately prioritize known causal markers. Overall, we confirm the strong predictive performance for BayesR in moderately to highly heritable traits, particularly for 50k custom data. In cases of low heritability or weak linkage disequilibrium with the causal marker in 50k genotypes, QTL mapping is a challenge, regardless of the criterion used. BayesR is a promising approach to simultaneously obtain accurate predictions and interpretable classifications of SNPs into effect size classes. We illustrated the performance of BayesR in a variety of simulation scenarios, and compared the advantages and limitations of each.



2020 ◽  
Author(s):  
Jenke Scheen ◽  
Wilson Wu ◽  
Antonia S. J. S. Mey ◽  
Paolo Tosco ◽  
Mark Mackey ◽  
...  

A methodology that combines alchemical free energy calculations (FEP) with machine learning (ML) has been developed to compute accurate absolute hydration free energies. The hybrid FEP/ML methodology was trained on a subset of the FreeSolv database, and retrospectively shown to outperform most submissions from the SAMPL4 competition. Compared to pure machine-learning approaches, FEP/ML yields more precise estimates of free energies of hydration, and requires a fraction of the training set size to outperform standalone FEP calculations. The ML-derived correction terms are further shown to be transferable to a range of related FEP simulation protocols. The approach may be used to inexpensively improve the accuracy of FEP calculations, and to flag molecules which will benefit the most from bespoke forcefield parameterisation efforts.



Author(s):  
André Maletzke ◽  
Waqar Hassan ◽  
Denis dos Reis ◽  
Gustavo Batista

Quantification is a task similar to classification in the sense that it learns from a labeled training set. However, quantification is not interested in predicting the class of each observation, but rather measure the class distribution in the test set. The community has developed performance measures and experimental setups tailored to quantification tasks. Nonetheless, we argue that a critical variable, the size of the test sets, remains ignored. Such disregard has three main detrimental effects. First, it implicitly assumes that quantifiers will perform equally well for different test set sizes. Second, it increases the risk of cherry-picking by selecting a test set size for which a particular proposal performs best. Finally, it disregards the importance of designing methods that are suitable for different test set sizes. We discuss these issues with the support of one of the broadest experimental evaluations ever performed, with three main outcomes. (i) We empirically demonstrate the importance of the test set size to assess quantifiers. (ii) We show that current quantifiers generally have a mediocre performance on the smallest test sets. (iii) We propose a metalearning scheme to select the best quantifier based on the test size that can outperform the best single quantification method.



2021 ◽  
Vol 12 ◽  
Author(s):  
Zigui Wang ◽  
Hao Cheng

Genomic prediction has been widely used in multiple areas and various genomic prediction methods have been developed. The majority of these methods, however, focus on statistical properties and ignore the abundant useful biological information like genome annotation or previously discovered causal variants. Therefore, to improve prediction performance, several methods have been developed to incorporate biological information into genomic prediction, mostly in single-trait analysis. A commonly used method to incorporate biological information is allocating molecular markers into different classes based on the biological information and assigning separate priors to molecular markers in different classes. It has been shown that such methods can achieve higher prediction accuracy than conventional methods in some circumstances. However, these methods mainly focus on single-trait analysis, and available priors of these methods are limited. Thus, in both single-trait and multiple-trait analysis, we propose the multi-class Bayesian Alphabet methods, in which multiple Bayesian Alphabet priors, including RR-BLUP, BayesA, BayesB, BayesCΠ, and Bayesian LASSO, can be used for markers allocated to different classes. The superior performance of the multi-class Bayesian Alphabet in genomic prediction is demonstrated using both real and simulated data. The software tool JWAS offers open-source routines to perform these analyses.



2009 ◽  
Vol 21 (7) ◽  
pp. 2082-2103 ◽  
Author(s):  
Shirish Shevade ◽  
S. Sundararajan

Gaussian processes (GPs) are promising Bayesian methods for classification and regression problems. Design of a GP classifier and making predictions using it is, however, computationally demanding, especially when the training set size is large. Sparse GP classifiers are known to overcome this limitation. In this letter, we propose and study a validation-based method for sparse GP classifier design. The proposed method uses a negative log predictive (NLP) loss measure, which is easy to compute for GP models. We use this measure for both basis vector selection and hyperparameter adaptation. The experimental results on several real-world benchmark data sets show better or comparable generalization performance over existing methods.



Sign in / Sign up

Export Citation Format

Share Document