Multivariate Statistical Machine Learning Methods for Genomic Prediction

AbstractThe Bayesian paradigm for parameter estimation is introduced and linked to the main problem of genomic-enabled prediction to predict the trait of interest of the non-phenotyped individuals from genotypic information, environment variables, or other information (covariates). In this situation, a convenient practice is to include the individuals to be predicted in the posterior distribution to be sampled. We explained how the Bayesian Ridge regression method is derived and exemplified with data from plant breeding genomic selection. Other Bayesian methods (Bayes A, Bayes B, Bayes C, and Bayesian Lasso) were also described and exemplified for genome-based prediction. The chapter presented several examples that were implemented in the Bayesian generalized linear regression (BGLR) library for continuous response variables. The predictor under all these Bayesian methods includes main effects (of environments and genotypes) as well as interaction terms related to genotype × environment interaction.

Download Full-text

Random Forest for Genomic Prediction

Multivariate Statistical Machine Learning Methods for Genomic Prediction ◽

10.1007/978-3-030-89010-0_15 ◽

2022 ◽

pp. 633-681

Author(s):

Osval Antonio Montesinos López ◽

Abelardo Montesinos López ◽

Jose Crossa

Keyword(s):

Random Forest ◽

Plant Breeding ◽

Genomic Selection ◽

Genomic Prediction ◽

Random Forest Algorithm ◽

Count Response ◽

Forest Models ◽

Random Forest Models ◽

Pros And Cons ◽

Response Variables

AbstractWe give a detailed description of random forest and exemplify its use with data from plant breeding and genomic selection. The motivations for using random forest in genomic-enabled prediction are explained. Then we describe the process of building decision trees, which are a key component for building random forest models. We give (1) the random forest algorithm, (2) the main hyperparameters that need to be tuned, and (3) different splitting rules that are key for implementing random forest models for continuous, binary, categorical, and count response variables. In addition, many examples are provided for training random forest models with different types of response variables with plant breeding data. The random forest algorithm for multivariate outcomes is provided and its most popular splitting rules are also explained. In this case, some examples are provided for illustrating its implementation even with mixed outcomes (continuous, binary, and categorical). Final comments about the pros and cons of random forest are provided.

Download Full-text

Bayesian and Classical Prediction Models for Categorical and Count Data

Multivariate Statistical Machine Learning Methods for Genomic Prediction ◽

10.1007/978-3-030-89010-0_7 ◽

2022 ◽

pp. 209-249

Author(s):

Osval Antonio Montesinos López ◽

Abelardo Montesinos López ◽

Jose Crossa

Keyword(s):

Logistic Regression ◽

Count Data ◽

Poisson Regression ◽

Prediction Models ◽

Multinomial Logistic Regression ◽

Bayesian Framework ◽

Environment Interaction ◽

Genotype Environment Interaction ◽

Interaction Term ◽

Main Effects

AbstractIn this chapter, we explain, under a Bayesian framework, the fundamentals and practical issues for implementing genomic prediction models for categorical and count traits. First, we derive the Bayesian ordinal model and exemplify it with plant breeding data. These examples were implemented in the library BGLR. We also derive the ordinal logistic regression. The fundamentals and practical issues of penalized multinomial logistic regression and penalized Poisson regression are given including several examples illustrating the use of the glmnet library. All the examples include main effects of environments and genotypes as well as the genotype × environment interaction term.

Download Full-text

Support Vector Machines and Support Vector Regression

Multivariate Statistical Machine Learning Methods for Genomic Prediction ◽

10.1007/978-3-030-89010-0_9 ◽

2022 ◽

pp. 337-378

Author(s):

Osval Antonio Montesinos López ◽

Abelardo Montesinos López ◽

Jose Crossa

Keyword(s):

Support Vector Machines ◽

Plant Breeding ◽

Genomic Prediction ◽

Kernel Functions ◽

Support Vector ◽

Continuous Response ◽

Support Vector Classifier ◽

Vector Machines ◽

Categorical Response ◽

Response Variables

AbstractIn this chapter, the support vector machines (svm) methods are studied. We first point out the origin and popularity of these methods and then we define the hyperplane concept which is the key for building these methods. We derive methods related to svm: the maximum margin classifier and the support vector classifier. We describe the derivation of the svm along with some kernel functions that are fundamental for building the different kernels methods that are allowed in svm. We explain how the svm for binary response variables can be expanded for categorical response variables and give examples of svm for binary and categorical response variables with plant breeding data for genomic selection. Finally, general issues for adopting the svm methodology for continuous response variables are provided, and some examples of svm for continuous response variables for genomic prediction are described.

Download Full-text

Linear Mixed Models

Multivariate Statistical Machine Learning Methods for Genomic Prediction ◽

10.1007/978-3-030-89010-0_5 ◽

2022 ◽

pp. 141-170

Author(s):

Osval Antonio Montesinos López ◽

Abelardo Montesinos López ◽

Jose Crossa

Keyword(s):

Parameter Estimation ◽

Maximum Likelihood ◽

Mixed Models ◽

Mixed Model ◽

Linear Mixed Model ◽

Linear Mixed Models ◽

Prediction Performance ◽

Environment Interaction ◽

Model Framework ◽

Genotype Environment Interaction

AbstractThe linear mixed model framework is explained in detail in this chapter. We explore three methods of parameter estimation (maximum likelihood, EM algorithm, and REML) and illustrate how genomic-enabled predictions are performed under this framework. We illustrate the use of linear mixed models by using the predictor several components such as environments, genotypes, and genotype × environment interaction. Also, the linear mixed model is illustrated under a multi-trait framework that is important in the prediction performance when the degree of correlation between traits is moderate or large. We illustrate the use of single-trait and multi-trait linear mixed models and provide the R codes for performing the analyses.

Download Full-text

Preprocessing Tools for Data Preparation

Multivariate Statistical Machine Learning Methods for Genomic Prediction ◽

10.1007/978-3-030-89010-0_2 ◽

2022 ◽

pp. 35-70

Author(s):

Osval Antonio Montesinos López ◽

Abelardo Montesinos López ◽

Jose Crossa

Keyword(s):

Machine Learning ◽

Mixed Model ◽

Linear Mixed Model ◽

Genomic Relationship Matrix ◽

Relationship Matrix ◽

Data Preparation ◽

Statistical Machine Learning ◽

Major Allele Frequency ◽

Major Allele ◽

Best Linear Unbiased

AbstractThis data preparation chapter is of paramount importance for implementing statistical machine learning methods for genomic selection. We present the basic linear mixed model that gives rise to BLUE and BLUP and explain how to decide when to use fixed or random effects that give rise to best linear unbiased estimates (BLUE or BLUEs) and best linear unbiased predictors (BLUP or BLUPs). The R codes for fitting linear mixed model for the data are given in small examples. We emphasize tools for computing BLUEs and BLUPs for many linear combinations of interest in genomic-enabled prediction and plant breeding. We present tools for cleaning, imputing, and detecting minor and major allele frequency computation, marker recodification, frequency of heterogeneous, frequency of NAs, and three methods for computing the genomic relationship matrix. In addition, scaling and data compression of inputs are important in statistical machine learning. For a more extensive description of linear mixed models, see Chap. 10.1007/978-3-030-89010-0_5.

Download Full-text

Overfitting, Model Tuning, and Evaluation of Prediction Performance

Multivariate Statistical Machine Learning Methods for Genomic Prediction ◽

10.1007/978-3-030-89010-0_4 ◽

2022 ◽

pp. 109-139

Author(s):

Osval Antonio Montesinos López ◽

Abelardo Montesinos López ◽

Jose Crossa

Keyword(s):

Machine Learning ◽

Predictive Modeling ◽

Learning Model ◽

Prediction Performance ◽

Training Data ◽

Statistical Machine Learning ◽

Machine Learning Model ◽

Data Points ◽

The Difference ◽

Model Tuning

AbstractThe overfitting phenomenon happens when a statistical machine learning model learns very well about the noise as well as the signal that is present in the training data. On the other hand, an underfitted phenomenon occurs when only a few predictors are included in the statistical machine learning model that represents the complete structure of the data pattern poorly. This problem also arises when the training data set is too small and thus an underfitted model does a poor job of fitting the training data and unsatisfactorily predicts new data points. This chapter describes the importance of the trade-off between prediction accuracy and model interpretability, as well as the difference between explanatory and predictive modeling: Explanatory modeling minimizes bias, whereas predictive modeling seeks to minimize the combination of bias and estimation variance. We assess the importance and different methods of cross-validation as well as the importance and strategies of tuning that are key to the successful use of some statistical machine learning methods. We explain the most important metrics for evaluating the prediction performance for continuous, binary, categorical, and count response variables.

Download Full-text

Functional Regression

Multivariate Statistical Machine Learning Methods for Genomic Prediction ◽

10.1007/978-3-030-89010-0_14 ◽

2022 ◽

pp. 579-631

Author(s):

Osval Antonio Montesinos López ◽

Abelardo Montesinos López ◽

Jose Crossa

Keyword(s):

Implementation Process ◽

Basis Functions ◽

Functional Regression ◽

Data Sets ◽

Coefficient Function ◽

Environment Interaction ◽

Periodic Data ◽

Fixed Model ◽

Small Data Sets ◽

Main Effects

AbstractThis chapter deals with the main theoretical fundamentals and practical issues of using functional regression in the context of genomic prediction. We explain how to represent data in functions by means of basis functions and considered two basis functions: Fourier for periodic or near-periodic data and B-splines for nonperiodic data. We derived the functional regression with a smoothed coefficient function under a fixed model framework and some examples are also provided under this model. A Bayesian version of functional regression is outlined and explained and all details for its implementation in glmnet and BGLR are given. The examples take into account in the predictor the main effects of environments and genotypes and the genotype × environment interaction term. The examples are done with small data sets so that the user can run them on his/her own computer and can understand the implementation process.

Download Full-text

Artificial Neural Networks and Deep Learning for Genomic Prediction of Continuous Outcomes

Multivariate Statistical Machine Learning Methods for Genomic Prediction ◽

10.1007/978-3-030-89010-0_11 ◽

2022 ◽

pp. 427-476

Author(s):

Osval Antonio Montesinos López ◽

Abelardo Montesinos López ◽

Jose Crossa

Keyword(s):

Neural Networks ◽

Artificial Neural Networks ◽

Deep Learning ◽

Deep Neural Networks ◽

Learning Rate ◽

Continuous Outcomes ◽

General Guide ◽

Tuning Process ◽

Hidden Layer ◽

Learning Frameworks

AbstractThis chapter provides elements for implementing deep neural networks (deep learning) for continuous outcomes. We give details of the hyperparameters to be tuned in deep neural networks and provide a general guide for doing this task with more probability of success. Then we explain the most popular deep learning frameworks that can be used to implement these models as well as the most popular optimizers available in many software programs for deep learning. Several practical examples with plant breeding data for implementing deep neural networks in the Keras library are outlined. These examples take into account many components in the predictor as well many hyperparameters (hidden layer, number of neurons, learning rate, optimizers, penalization, etc.) for which we also illustrate how the tuning process can be done to increase the probability of a successful application.

Download Full-text

General Elements of Genomic Selection and Statistical Learning

Multivariate Statistical Machine Learning Methods for Genomic Prediction ◽

10.1007/978-3-030-89010-0_1 ◽

2022 ◽

pp. 1-34

Author(s):

Osval Antonio Montesinos López ◽

Abelardo Montesinos López ◽

Jose Crossa

Keyword(s):

Machine Learning ◽

Genomic Selection ◽

Statistical Learning ◽

Model Building ◽

Parametric Models ◽

Statistical Machine Learning ◽

Learning Methods ◽

Nonparametric Models ◽

Use Of Data ◽

Huge Data

AbstractNowadays, huge data quantities are collected and analyzed for delivering deep insights into biological processes and human behavior. This chapter assesses the use of big data for prediction and estimation through statistical machine learning and its applications in agriculture and genetics in general, and specifically, for genome-based prediction and selection. First, we point out the importance of data and how the use of data is reshaping our way of living. We also provide the key elements of genomic selection and its potential for plant improvement. In addition, we analyze elements of modeling with machine learning methods applied to genomic selection and stress their importance as a predictive methodology. Two cultures of model building are analyzed and discussed: prediction and inference; by understanding modeling building, researchers will be able to select the best model/method for each circumstance. Within this context, we explain the differences between nonparametric models (predictors are constructed according to information derived from data) and parametric models (all the predictors take predetermined forms with the response) as well their type of effects: fixed, random, and mixed. Basic elements of linear algebra are provided to facilitate understanding the contents of the book. This chapter also contains examples of the different types of data using supervised, unsupervised, and semi-supervised learning methods.

Download Full-text

Multivariate Statistical Machine Learning Methods for Genomic Prediction
Latest Publications

TOTAL DOCUMENTS

H-INDEX

Published By Springer International Publishing

Bayesian Genomic Linear Regression

Random Forest for Genomic Prediction

Bayesian and Classical Prediction Models for Categorical and Count Data

Support Vector Machines and Support Vector Regression

Linear Mixed Models

Preprocessing Tools for Data Preparation

Overfitting, Model Tuning, and Evaluation of Prediction Performance

Functional Regression

Artificial Neural Networks and Deep Learning for Genomic Prediction of Continuous Outcomes

General Elements of Genomic Selection and Statistical Learning

Export Citation Format

Multivariate Statistical Machine Learning Methods for Genomic PredictionLatest Publications

TOTAL DOCUMENTS

H-INDEX

Published By Springer International Publishing

Bayesian Genomic Linear Regression

Random Forest for Genomic Prediction

Bayesian and Classical Prediction Models for Categorical and Count Data

Support Vector Machines and Support Vector Regression

Linear Mixed Models

Preprocessing Tools for Data Preparation

Overfitting, Model Tuning, and Evaluation of Prediction Performance

Functional Regression

Artificial Neural Networks and Deep Learning for Genomic Prediction of Continuous Outcomes

General Elements of Genomic Selection and Statistical Learning

Multivariate Statistical Machine Learning Methods for Genomic Prediction
Latest Publications