scholarly journals Optimal breeding-value prediction using a Sparse Selection Index

Genetics ◽  
2021 ◽  
Author(s):  
Marco Lopez-Cruz ◽  
Gustavo de los Campos

Abstract Genomic prediction uses DNA sequences and phenotypes to predict genetic values. In homogeneous populations, theory indicates that the accuracy of genomic prediction increases with sample size. However, differences in allele frequencies and in linkage disequilibrium patterns can lead to heterogeneity in SNP effects. In this context, calibrating genomic predictions using a large, potentially heterogeneous, training data set may not lead to optimal prediction accuracy. Some studies tried to address this sample size/homogeneity trade-off using training set optimization algorithms; however, this approach assumes that a single training data set is optimum for all individuals in the prediction set. Here, we propose an approach that identifies, for each individual in the prediction set, a subset from the training data (i.e., a set of support points) from which predictions are derived. The methodology that we propose is a Sparse Selection Index (SSI) that integrates Selection Index methodology with sparsity-inducing techniques commonly used for high-dimensional regression. The sparsity of the resulting index is controlled by a regularization parameter (λ); the G-BLUP (the prediction method most commonly used in plant and animal breeding) appears as a special case which happens when λ = 0. In this study, we present the methodology and demonstrate (using two wheat data sets with phenotypes collected in ten different environments) that the SSI can achieve significant (anywhere between 5-10%) gains in prediction accuracy relative to the G-BLUP.

Heredity ◽  
2021 ◽  
Author(s):  
Marco Lopez-Cruz ◽  
Yoseph Beyene ◽  
Manje Gowda ◽  
Jose Crossa ◽  
Paulino Pérez-Rodríguez ◽  
...  

AbstractGenomic prediction models are often calibrated using multi-generation data. Over time, as data accumulates, training data sets become increasingly heterogeneous. Differences in allele frequency and linkage disequilibrium patterns between the training and prediction genotypes may limit prediction accuracy. This leads to the question of whether all available data or a subset of it should be used to calibrate genomic prediction models. Previous research on training set optimization has focused on identifying a subset of the available data that is optimal for a given prediction set. However, this approach does not contemplate the possibility that different training sets may be optimal for different prediction genotypes. To address this problem, we recently introduced a sparse selection index (SSI) that identifies an optimal training set for each individual in a prediction set. Using additive genomic relationships, the SSI can provide increased accuracy relative to genomic-BLUP (GBLUP). Non-parametric genomic models using Gaussian kernels (KBLUP) have, in some cases, yielded higher prediction accuracies than standard additive models. Therefore, here we studied whether combining SSIs and kernel methods could further improve prediction accuracy when training genomic models using multi-generation data. Using four years of doubled haploid maize data from the International Maize and Wheat Improvement Center (CIMMYT), we found that when predicting grain yield the KBLUP outperformed the GBLUP, and that using SSI with additive relationships (GSSI) lead to 5–17% increases in accuracy, relative to the GBLUP. However, differences in prediction accuracy between the KBLUP and the kernel-based SSI were smaller and not always significant.


2019 ◽  
Author(s):  
Daniel Runcie ◽  
Hao Cheng

ABSTRACTIncorporating measurements on correlated traits into genomic prediction models can increase prediction accuracy and selection gain. However, multi-trait genomic prediction models are complex and prone to overfitting which may result in a loss of prediction accuracy relative to single-trait genomic prediction. Cross-validation is considered the gold standard method for selecting and tuning models for genomic prediction in both plant and animal breeding. When used appropriately, cross-validation gives an accurate estimate of the prediction accuracy of a genomic prediction model, and can effectively choose among disparate models based on their expected performance in real data. However, we show that a naive cross-validation strategy applied to the multi-trait prediction problem can be severely biased and lead to sub-optimal choices between single and multi-trait models when secondary traits are used to aid in the prediction of focal traits and these secondary traits are measured on the individuals to be tested. We use simulations to demonstrate the extent of the problem and propose three partial solutions: 1) a parametric solution from selection index theory, 2) a semi-parametric method for correcting the cross-validation estimates of prediction accuracy, and 3) a fully non-parametric method which we call CV2*: validating model predictions against focal trait measurements from genetically related individuals. The current excitement over high-throughput phenotyping suggests that more comprehensive phenotype measurements will be useful for accelerating breeding programs. Using an appropriate cross-validation strategy should more reliably determine if and when combining information across multiple traits is useful.


2018 ◽  
Vol 2018 ◽  
pp. 1-9 ◽  
Author(s):  
Hyungsik Shin ◽  
Jeongyeup Paek

Automatic task classification is a core part of personal assistant systems that are widely used in mobile devices such as smartphones and tablets. Even though many industry leaders are providing their own personal assistant services, their proprietary internals and implementations are not well known to the public. In this work, we show through real implementation and evaluation that automatic task classification can be implemented for mobile devices by using the support vector machine algorithm and crowdsourcing. To train our task classifier, we collected our training data set via crowdsourcing using the Amazon Mechanical Turk platform. Our classifier can classify a short English sentence into one of the thirty-two predefined tasks that are frequently requested while using personal mobile devices. Evaluation results show high prediction accuracy of our classifier ranging from 82% to 99%. By using large amount of crowdsourced data, we also illustrate the relationship between training data size and the prediction accuracy of our task classifier.


2013 ◽  
Vol 380-384 ◽  
pp. 1673-1676
Author(s):  
Juan Du

In order to show the time cumulative effect in the process for the time series prediction, the process neural network is taken. The training algorithm of modified particle swarm is used to the model for the learning speed. The training data is sunspot data from 1700 to 2007. Simulation result shows that the prediction model and algorithm has faster training speed and prediction accuracy than the artificial neural network.


2015 ◽  
Author(s):  
Abelardo Montesinos-Lopez ◽  
Osval Montesinos-Lopez ◽  
Jose Crossa ◽  
Juan Burgueno ◽  
Kent Eskridge ◽  
...  

Genomic tools allow the study of the whole genome and are facilitating the study of genotype-environment combinations and their relationship with the phenotype. However, most genomic prediction models developed so far are appropriate for Gaussian phenotypes. For this reason, appropriate genomic prediction models are needed for count data, since the conventional regression models used on count data with a large sample size (n) and a small number of parameters (p) cannot be used for genomic-enabled prediction where the number of parameters (p) is larger than the sample size (n). Here we propose a Bayesian mixed negative binomial (BMNB) genomic regression model for counts that takes into account genotype by environment (G × E) interaction. We also provide all the full conditional distributions to implement a Gibbs sampler. We evaluated the proposed model using a simulated data set and a real wheat data set from the International Maize and Wheat Improvement Center (CIMMYT) and collaborators. Results indicate that our BMNB model is a viable alternative for analyzing count data.


2014 ◽  
Vol 7 (4) ◽  
pp. 132-143
Author(s):  
ABBAS M. ABD ◽  
SAAD SH. SAMMEN

The prediction of different hydrological phenomenon (or system) plays an increasing role in the management of water resources. As engineers; it is required to predict the component of natural reservoirs’ inflow for numerous purposes. Resulting prediction techniques vary with the potential purpose, characteristics, and documented data. The best prediction method is of interest of experts to overcome the uncertainty, because the most hydrological parameters are subjected to the uncertainty. Artificial Neural Network (ANN) approach has adopted in this paper to predict Hemren reservoir inflow. Available data including monthly discharge supplied from DerbendiKhan reservoir and rain fall intensity falling on the intermediate catchment area between Hemren-DerbendiKhan dams were used.A Back Propagation (LMBP) algorithm (Levenberg-Marquardt) has been utilized to construct the ANN models. For the developed ANN model, different networks with different numbers of neurons and layers were evaluated. A total of 24 years of historical data for interval from 1980 to 2004 were used to train and test the networks. The optimum ANN network with 3 inputs, 40 neurons in both two hidden layers and one output was selected. Mean Squared Error (MSE) and the Correlation Coefficient (CC) were employed to evaluate the accuracy of the proposed model. The network was trained and converged at MSE = 0.027 by using training data subjected to early stopping approach. The network could forecast the testing data set with the accuracy of MSE = 0.031. Training and testing process showed the correlation coefficient of 0.97 and 0.77 respectively and this is refer to a high precision of that prediction technique.


2021 ◽  
Vol 73 (11) ◽  
pp. 65-66
Author(s):  
Chris Carpenter

This article, written by JPT Technology Editor Chris Carpenter, contains highlights of paper SPE 203962, “Upscaling of Realistic Discrete Fracture Simulations Using Machine Learning,” by Nikolai Andrianov, SPE, Geological Survey of Denmark and Greenland, prepared for the 2021 SPE Reservoir Simulation Conference, Galveston, Texas, 4–6 October. The paper has not been peer reviewed. Upscaling of discrete fracture networks to continuum models such as the dual-porosity/dual-permeability (DP/DP) model is an industry-standard approach in modeling fractured reservoirs. In the complete paper, the author parametrizes the fine-scale fracture geometries and assesses the accuracy of several convolutional neural networks (CNNs) to learn the mapping between this parametrization and DP/DP model closures. The accuracy of the DP/DP results with the predicted model closures was assessed by a comparison with the corresponding fine-scale discrete fracture matrix (DFM) simulation of a two-phase flow in a realistic fracture geometry. The DP/DP results matched the DFM reference solution well. The DP/DP model also was significantly faster than DFM simulation. Introduction The goal of this study was to evaluate the effect of different CNN architectures on prediction accuracy for the DP/DP model closures and on the accuracy of DP/DP simulations in comparison with fine-scale DFM simulations. As a starting point, two CNN configurations were considered that have achieved breakthrough performance in image-classification tasks. The author adopted these architectures to the problem of learning the mapping between pixelated fracture geometries and the DP/DP model closures and indicated several key features in the CNN structure that are crucial for achieving high prediction accuracy. Mapping of fracture geometries requires significant effort, which limits the possibilities for creating large training data sets with realistic fracture geometries. The author, therefore, used the synthetic random linear fractures’ data set to train the CNNs and the fracture geometry from the Lägerdorf outcrop for testing purposes. It was demonstrated that an optimal CNN configuration yielded the DP/DP model closures such that the corresponding DP/DP results matched well the two-phase DFM simulations on a subset of the Lägerdorf data. The run times for the DP/DP model were a fraction of the time needed to accomplish the DFM simulations. Problem formulation is presented in a series of equations in the complete paper.


2020 ◽  
Vol 10 (8) ◽  
pp. 2725-2739 ◽  
Author(s):  
Diego Jarquin ◽  
Reka Howard ◽  
Jose Crossa ◽  
Yoseph Beyene ◽  
Manje Gowda ◽  
...  

“Sparse testing” refers to reduced multi-environment breeding trials in which not all genotypes of interest are grown in each environment. Using genomic-enabled prediction and a model embracing genotype × environment interaction (GE), the non-observed genotype-in-environment combinations can be predicted. Consequently, the overall costs can be reduced and the testing capacities can be increased. The accuracy of predicting the unobserved data depends on different factors including (1) how many genotypes overlap between environments, (2) in how many environments each genotype is grown, and (3) which prediction method is used. In this research, we studied the predictive ability obtained when using a fixed number of plots and different sparse testing designs. The considered designs included the extreme cases of (1) no overlap of genotypes between environments, and (2) complete overlap of the genotypes between environments. In the latter case, the prediction set fully consists of genotypes that have not been tested at all. Moreover, we gradually go from one extreme to the other considering (3) intermediates between the two previous cases with varying numbers of different or non-overlapping (NO)/overlapping (O) genotypes. The empirical study is built upon two different maize hybrid data sets consisting of different genotypes crossed to two different testers (T1 and T2) and each data set was analyzed separately. For each set, phenotypic records on yield from three different environments are available. Three different prediction models were implemented, two main effects models (M1 and M2), and a model (M3) including GE. The results showed that the genome-based model including GE (M3) captured more phenotypic variation than the models that did not include this component. Also, M3 provided higher prediction accuracy than models M1 and M2 for the different allocation scenarios. Reducing the size of the calibration sets decreased the prediction accuracy under all allocation designs with M3 being the less affected model; however, using the genome-enabled models (i.e., M2 and M3) the predictive ability is recovered when more genotypes are tested across environments. Our results indicate that a substantial part of the testing resources can be saved when using genome-based models including GE for optimizing sparse testing designs.


2021 ◽  
Vol 12 ◽  
Author(s):  
Stefan Wilson ◽  
Chaozhi Zheng ◽  
Chris Maliepaard ◽  
Han A. Mulder ◽  
Richard G. F. Visser ◽  
...  

Use of genomic prediction (GP) in tetraploid is becoming more common. Therefore, we think it is the right time for a comparison of GP models for tetraploid potato. GP models were compared that contrasted shrinkage with variable selection, parametric vs. non-parametric models and different ways of accounting for non-additive genetic effects. As a complement to GP, association studies were carried out in an attempt to understand the differences in prediction accuracy. We compared our GP models on a data set consisting of 147 cultivars, representing worldwide diversity, with over 39 k GBS markers and measurements on four tuber traits collected in six trials at three locations during 2 years. GP accuracies ranged from 0.32 for tuber count to 0.77 for dry matter content. For all traits, differences between GP models that utilised shrinkage penalties and those that performed variable selection were negligible. This was surprising for dry matter, as only a few additive markers explained over 50% of phenotypic variation. Accuracy for tuber count increased from 0.35 to 0.41, when dominance was included in the model. This result is supported by Genome Wide Association Study (GWAS) that found additive and dominance effects accounted for 37% of phenotypic variation, while significant additive effects alone accounted for 14%. For tuber weight, the Reproducing Kernel Hilbert Space (RKHS) model gave a larger improvement in prediction accuracy than explicitly modelling epistatic effects. This is an indication that capturing the between locus epistatic effects of tuber weight can be done more effectively using the semi-parametric RKHS model. Our results show good opportunities for GP in 4x potato.


Electronics ◽  
2021 ◽  
Vol 10 (16) ◽  
pp. 1995
Author(s):  
Pingakshya Goswami ◽  
Dinesh Bhatia

Design closure in general VLSI physical design flows and FPGA physical design flows is an important and time-consuming problem. Routing itself can consume as much as 70% of the total design time. Accurate congestion estimation during the early stages of the design flow can help alleviate last-minute routing-related surprises. This paper has described a methodology for a post-placement, machine learning-based routing congestion prediction model for FPGAs. Routing congestion is modeled as a regression problem. We have described the methods for generating training data, feature extractions, training, regression models, validation, and deployment approaches. We have tested our prediction model by using ISPD 2016 FPGA benchmarks. Our prediction method reports a very accurate localized congestion value in each channel around a configurable logic block (CLB). The localized congestion is predicted in both vertical and horizontal directions. We demonstrate the effectiveness of our model on completely unseen designs that are not initially part of the training data set. The generated results show significant improvement in terms of accuracy measured as mean absolute error and prediction time when compared against the latest state-of-the-art works.


Sign in / Sign up

Export Citation Format

Share Document