Multivariate Statistical Machine Learning Methods for Genomic Prediction
Latest Publications


TOTAL DOCUMENTS

14
(FIVE YEARS 14)

H-INDEX

0
(FIVE YEARS 0)

Published By Springer International Publishing

9783030890094, 9783030890100

Author(s):  
Osval Antonio Montesinos López ◽  
Abelardo Montesinos López ◽  
Jose Crossa

AbstractThe Bayesian paradigm for parameter estimation is introduced and linked to the main problem of genomic-enabled prediction to predict the trait of interest of the non-phenotyped individuals from genotypic information, environment variables, or other information (covariates). In this situation, a convenient practice is to include the individuals to be predicted in the posterior distribution to be sampled. We explained how the Bayesian Ridge regression method is derived and exemplified with data from plant breeding genomic selection. Other Bayesian methods (Bayes A, Bayes B, Bayes C, and Bayesian Lasso) were also described and exemplified for genome-based prediction. The chapter presented several examples that were implemented in the Bayesian generalized linear regression (BGLR) library for continuous response variables. The predictor under all these Bayesian methods includes main effects (of environments and genotypes) as well as interaction terms related to genotype × environment interaction.


Author(s):  
Osval Antonio Montesinos López ◽  
Abelardo Montesinos López ◽  
Jose Crossa

AbstractWe give a detailed description of random forest and exemplify its use with data from plant breeding and genomic selection. The motivations for using random forest in genomic-enabled prediction are explained. Then we describe the process of building decision trees, which are a key component for building random forest models. We give (1) the random forest algorithm, (2) the main hyperparameters that need to be tuned, and (3) different splitting rules that are key for implementing random forest models for continuous, binary, categorical, and count response variables. In addition, many examples are provided for training random forest models with different types of response variables with plant breeding data. The random forest algorithm for multivariate outcomes is provided and its most popular splitting rules are also explained. In this case, some examples are provided for illustrating its implementation even with mixed outcomes (continuous, binary, and categorical). Final comments about the pros and cons of random forest are provided.


Author(s):  
Osval Antonio Montesinos López ◽  
Abelardo Montesinos López ◽  
Jose Crossa

AbstractIn this chapter, we explain, under a Bayesian framework, the fundamentals and practical issues for implementing genomic prediction models for categorical and count traits. First, we derive the Bayesian ordinal model and exemplify it with plant breeding data. These examples were implemented in the library BGLR. We also derive the ordinal logistic regression. The fundamentals and practical issues of penalized multinomial logistic regression and penalized Poisson regression are given including several examples illustrating the use of the glmnet library. All the examples include main effects of environments and genotypes as well as the genotype × environment interaction term.


Author(s):  
Osval Antonio Montesinos López ◽  
Abelardo Montesinos López ◽  
Jose Crossa

AbstractIn this chapter, the support vector machines (svm) methods are studied. We first point out the origin and popularity of these methods and then we define the hyperplane concept which is the key for building these methods. We derive methods related to svm: the maximum margin classifier and the support vector classifier. We describe the derivation of the svm along with some kernel functions that are fundamental for building the different kernels methods that are allowed in svm. We explain how the svm for binary response variables can be expanded for categorical response variables and give examples of svm for binary and categorical response variables with plant breeding data for genomic selection. Finally, general issues for adopting the svm methodology for continuous response variables are provided, and some examples of svm for continuous response variables for genomic prediction are described.


Author(s):  
Osval Antonio Montesinos López ◽  
Abelardo Montesinos López ◽  
Jose Crossa

AbstractThe linear mixed model framework is explained in detail in this chapter. We explore three methods of parameter estimation (maximum likelihood, EM algorithm, and REML) and illustrate how genomic-enabled predictions are performed under this framework. We illustrate the use of linear mixed models by using the predictor several components such as environments, genotypes, and genotype × environment interaction. Also, the linear mixed model is illustrated under a multi-trait framework that is important in the prediction performance when the degree of correlation between traits is moderate or large. We illustrate the use of single-trait and multi-trait linear mixed models and provide the R codes for performing the analyses.


Author(s):  
Osval Antonio Montesinos López ◽  
Abelardo Montesinos López ◽  
Jose Crossa

AbstractThis data preparation chapter is of paramount importance for implementing statistical machine learning methods for genomic selection. We present the basic linear mixed model that gives rise to BLUE and BLUP and explain how to decide when to use fixed or random effects that give rise to best linear unbiased estimates (BLUE or BLUEs) and best linear unbiased predictors (BLUP or BLUPs). The R codes for fitting linear mixed model for the data are given in small examples. We emphasize tools for computing BLUEs and BLUPs for many linear combinations of interest in genomic-enabled prediction and plant breeding. We present tools for cleaning, imputing, and detecting minor and major allele frequency computation, marker recodification, frequency of heterogeneous, frequency of NAs, and three methods for computing the genomic relationship matrix. In addition, scaling and data compression of inputs are important in statistical machine learning. For a more extensive description of linear mixed models, see Chap. 10.1007/978-3-030-89010-0_5.


Author(s):  
Osval Antonio Montesinos López ◽  
Abelardo Montesinos López ◽  
Jose Crossa

AbstractThe overfitting phenomenon happens when a statistical machine learning model learns very well about the noise as well as the signal that is present in the training data. On the other hand, an underfitted phenomenon occurs when only a few predictors are included in the statistical machine learning model that represents the complete structure of the data pattern poorly. This problem also arises when the training data set is too small and thus an underfitted model does a poor job of fitting the training data and unsatisfactorily predicts new data points. This chapter describes the importance of the trade-off between prediction accuracy and model interpretability, as well as the difference between explanatory and predictive modeling: Explanatory modeling minimizes bias, whereas predictive modeling seeks to minimize the combination of bias and estimation variance. We assess the importance and different methods of cross-validation as well as the importance and strategies of tuning that are key to the successful use of some statistical machine learning methods. We explain the most important metrics for evaluating the prediction performance for continuous, binary, categorical, and count response variables.


Author(s):  
Osval Antonio Montesinos López ◽  
Abelardo Montesinos López ◽  
Jose Crossa

AbstractThis chapter deals with the main theoretical fundamentals and practical issues of using functional regression in the context of genomic prediction. We explain how to represent data in functions by means of basis functions and considered two basis functions: Fourier for periodic or near-periodic data and B-splines for nonperiodic data. We derived the functional regression with a smoothed coefficient function under a fixed model framework and some examples are also provided under this model. A Bayesian version of functional regression is outlined and explained and all details for its implementation in glmnet and BGLR are given. The examples take into account in the predictor the main effects of environments and genotypes and the genotype × environment interaction term. The examples are done with small data sets so that the user can run them on his/her own computer and can understand the implementation process.


Author(s):  
Osval Antonio Montesinos López ◽  
Abelardo Montesinos López ◽  
Jose Crossa

AbstractThis chapter provides elements for implementing deep neural networks (deep learning) for continuous outcomes. We give details of the hyperparameters to be tuned in deep neural networks and provide a general guide for doing this task with more probability of success. Then we explain the most popular deep learning frameworks that can be used to implement these models as well as the most popular optimizers available in many software programs for deep learning. Several practical examples with plant breeding data for implementing deep neural networks in the Keras library are outlined. These examples take into account many components in the predictor as well many hyperparameters (hidden layer, number of neurons, learning rate, optimizers, penalization, etc.) for which we also illustrate how the tuning process can be done to increase the probability of a successful application.


Author(s):  
Osval Antonio Montesinos López ◽  
Abelardo Montesinos López ◽  
Jose Crossa

AbstractNowadays, huge data quantities are collected and analyzed for delivering deep insights into biological processes and human behavior. This chapter assesses the use of big data for prediction and estimation through statistical machine learning and its applications in agriculture and genetics in general, and specifically, for genome-based prediction and selection. First, we point out the importance of data and how the use of data is reshaping our way of living. We also provide the key elements of genomic selection and its potential for plant improvement. In addition, we analyze elements of modeling with machine learning methods applied to genomic selection and stress their importance as a predictive methodology. Two cultures of model building are analyzed and discussed: prediction and inference; by understanding modeling building, researchers will be able to select the best model/method for each circumstance. Within this context, we explain the differences between nonparametric models (predictors are constructed according to information derived from data) and parametric models (all the predictors take predetermined forms with the response) as well their type of effects: fixed, random, and mixed. Basic elements of linear algebra are provided to facilitate understanding the contents of the book. This chapter also contains examples of the different types of data using supervised, unsupervised, and semi-supervised learning methods.


Sign in / Sign up

Export Citation Format

Share Document