Exploring Deep Learning for Complex Trait Genomic Prediction in Polyploid Outcrossing Species

Deep learning (DL) has emerged as a powerful tool to make accurate predictions from complex data such as image, text, or video. However, its ability to predict phenotypic values from molecular data is less well studied. Here, we describe the theoretical foundations of DL and provide a generic code that can be easily modified to suit specific needs. DL comprises a wide variety of algorithms which depend on numerous hyperparameters. Careful optimization of hyperparameter values is critical to avoid overfitting. Among the DL architectures currently tested in genomic prediction, convolutional neural networks (CNNs) seem more promising than multilayer perceptrons (MLPs). A limitation of DL is in interpreting the results. This may not be relevant for genomic prediction in plant or animal breeding but can be critical when deciding the genetic risk to a disease. Although DL technologies are not ”plug-and-play”, they are easily implemented using Keras and TensorFlow public software. To illustrate the principles described here, we implemented a Keras-based code in GitHub.

Download Full-text

A review of deep learning applications for genomic selection

BMC Genomics ◽

10.1186/s12864-020-07319-x ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Osval Antonio Montesinos-López ◽

Abelardo Montesinos-López ◽

Paulino Pérez-Rodríguez ◽

José Alberto Barrón-López ◽

Johannes W. R. Martini ◽

...

Keyword(s):

Deep Learning ◽

Plant Breeding ◽

Genomic Selection ◽

Genomic Prediction ◽

Mixed Model ◽

Prediction Models ◽

Genetic Effect ◽

Training Data ◽

Additive Genetic Effect ◽

Main Body

Abstract Background Several conventional genomic Bayesian (or no Bayesian) prediction methods have been proposed including the standard additive genetic effect model for which the variance components are estimated with mixed model equations. In recent years, deep learning (DL) methods have been considered in the context of genomic prediction. The DL methods are nonparametric models providing flexibility to adapt to complicated associations between data and output with the ability to adapt to very complex patterns. Main body We review the applications of deep learning (DL) methods in genomic selection (GS) to obtain a meta-picture of GS performance and highlight how these tools can help solve challenging plant breeding problems. We also provide general guidance for the effective use of DL methods including the fundamentals of DL and the requirements for its appropriate use. We discuss the pros and cons of this technique compared to traditional genomic prediction approaches as well as the current trends in DL applications. Conclusions The main requirement for using DL is the quality and sufficiently large training data. Although, based on current literature GS in plant and animal breeding we did not find clear superiority of DL in terms of prediction power compared to conventional genome based prediction models. Nevertheless, there are clear evidences that DL algorithms capture nonlinear patterns more efficiently than conventional genome based. Deep learning algorithms are able to integrate data from different sources as is usually needed in GS assisted breeding and it shows the ability for improving prediction accuracy for large plant breeding data. It is important to apply DL to large training-testing data sets.

Download Full-text

Fine mapping of QTL and genomic prediction using allele-specific expression SNPs demonstrates that the complex trait of genetic resistance to Marek’s disease is predominantly determined by transcriptional regulation

BMC Genomics ◽

10.1186/s12864-015-2016-0 ◽

2015 ◽

Vol 16 (1) ◽

Cited By ~ 14

Author(s):

Hans H. Cheng ◽

Sudeep Perumbakkam ◽

Alexis Black Pyrkosz ◽

John R. Dunn ◽

Andres Legarra ◽

...

Keyword(s):

Transcriptional Regulation ◽

Fine Mapping ◽

Genomic Prediction ◽

Genetic Resistance ◽

Complex Trait ◽

Marek's Disease ◽

Marek’S Disease ◽

Specific Expression ◽

Allele Specific Expression ◽

Allele Specific

Download Full-text

Perspectives on Applications of Hierarchical Gene-To-Phenotype (G2P) Maps to Capture Non-stationary Effects of Alleles in Genomic Prediction

Frontiers in Plant Science ◽

10.3389/fpls.2021.663565 ◽

2021 ◽

Vol 12 ◽

Author(s):

Owen M. Powell ◽

Kai P. Voss-Fels ◽

David R. Jordan ◽

Graeme Hammer ◽

Mark Cooper

Keyword(s):

Plant Breeding ◽

Genomic Prediction ◽

Complex Traits ◽

Prediction Accuracy ◽

Predictive Ability ◽

Complex Trait ◽

Substitution Effects ◽

Term Prediction ◽

Gxe Interactions

Genomic prediction of complex traits across environments, breeding cycles, and populations remains a challenge for plant breeding. A potential explanation for this is that underlying non-additive genetic (GxG) and genotype-by-environment (GxE) interactions generate allele substitution effects that are non-stationary across different contexts. Such non-stationary effects of alleles are either ignored or assumed to be implicitly captured by most gene-to-phenotype (G2P) maps used in genomic prediction. The implicit capture of non-stationary effects of alleles requires the G2P map to be re-estimated across different contexts. We discuss the development and application of hierarchical G2P maps that explicitly capture non-stationary effects of alleles and have successfully increased short-term prediction accuracy in plant breeding. These hierarchical G2P maps achieve increases in prediction accuracy by allowing intermediate processes such as other traits and environmental factors and their interactions to contribute to complex trait variation. However, long-term prediction remains a challenge. The plant breeding community should undertake complementary simulation and empirical experiments to interrogate various hierarchical G2P maps that connect GxG and GxE interactions simultaneously. The existing genetic correlation framework can be used to assess the magnitude of non-stationary effects of alleles and the predictive ability of these hierarchical G2P maps in long-term, multi-context genomic predictions of complex traits in plant breeding.

Download Full-text

Haploid maize seeds prediction using deep learning and using mock reference genomes for genomic prediction of hybrids

10.11606/t.11.2020.tde-12022021-155733 ◽

2020 ◽

Author(s):

José Felipe Gonzaga Sabadin

Keyword(s):

Deep Learning ◽

Genomic Prediction ◽

Maize Seeds ◽

Reference Genomes

Download Full-text

Genomic Prediction of Resistance to Tar Spot Complex of Maize in Multiple Populations Using Genotyping-by-Sequencing SNPs

Frontiers in Plant Science ◽

10.3389/fpls.2021.672525 ◽

2021 ◽

Vol 12 ◽

Author(s):

Shiliang Cao ◽

Junqiao Song ◽

Yibing Yuan ◽

Ao Zhang ◽

Jiaojiao Ren ◽

...

Keyword(s):

Genetic Diversity ◽

Genomic Prediction ◽

Prediction Accuracy ◽

Genetic Relationships ◽

Genotyping By Sequencing ◽

Genotypic Diversity ◽

Complex Trait ◽

Training Set ◽

Tar Spot ◽

Training Sets

Tar spot complex (TSC) is one of the most important foliar diseases in tropical maize. TSC resistance could be furtherly improved by implementing marker-assisted selection (MAS) and genomic selection (GS) individually, or by implementing them stepwise. Implementation of GS requires a profound understanding of factors affecting genomic prediction accuracy. In the present study, an association-mapping panel and three doubled haploid populations, genotyped with genotyping-by-sequencing, were used to estimate the effectiveness of GS for improving TSC resistance. When the training and prediction sets were independent, moderate-to-high prediction accuracies were achieved across populations by using the training sets with broader genetic diversity, or in pairwise populations having closer genetic relationships. A collection of inbred lines with broader genetic diversity could be used as a permanent training set for TSC improvement, which can be updated by adding more phenotyped lines having closer genetic relationships with the prediction set. The prediction accuracies estimated with a few significantly associated SNPs were moderate-to-high, and continuously increased as more significantly associated SNPs were included. It confirmed that TSC resistance could be furtherly improved by implementing GS for selecting multiple stable genomic regions simultaneously, or by implementing MAS and GS stepwise. The factors of marker density, marker quality, and heterozygosity rate of samples had minor effects on the estimation of the genomic prediction accuracy. The training set size, the genetic relationship between training and prediction sets, phenotypic and genotypic diversity of the training sets, and incorporating known trait-marker associations played more important roles in improving prediction accuracy. The result of the present study provides insight into less complex trait improvement via GS in maize.

Download Full-text

Multi-Trait, Multi-Environment Genomic Prediction of Durum Wheat With Genomic Best Linear Unbiased Predictor and Deep Learning Methods

Frontiers in Plant Science ◽

10.3389/fpls.2019.01311 ◽

2019 ◽

Vol 10 ◽

Cited By ~ 2

Author(s):

Osval A. Montesinos-López ◽

Abelardo Montesinos-López ◽

Roberto Tuberosa ◽

Marco Maccaferri ◽

Giuseppe Sciara ◽

...

Keyword(s):

Deep Learning ◽

Durum Wheat ◽

Genomic Prediction ◽

Best Linear Unbiased Predictor ◽

Learning Methods ◽

Best Linear Unbiased

Download Full-text

Multi-Trait Genomic Prediction of Yield-Related Traits in US Soft Wheat under Variable Water Regimes

Genes ◽

10.3390/genes11111270 ◽

2020 ◽

Vol 11 (11) ◽

pp. 1270 ◽

Cited By ~ 1

Author(s):

Jia Guo ◽

Jahangir Khan ◽

Sumit Pradhan ◽

Dipendra Shahi ◽

Naeem Khan ◽

...

Keyword(s):

Deep Learning ◽

Genomic Prediction ◽

Genotyping By Sequencing ◽

Triticum Aestivum L ◽

Yield Component ◽

Soft Wheat ◽

Response To Selection ◽

Best Linear Unbiased ◽

Trait Model ◽

Variable Water

The performance of genomic prediction (GP) on genetically correlated traits can be improved through an interdependence multi-trait model under a multi-environment context. In this study, a panel of 237 soft facultative wheat (Triticum aestivum L.) lines was evaluated to compare single- and multi-trait models for predicting grain yield (GY), harvest index (HI), spike fertility (SF), and thousand grain weight (TGW). The panel was phenotyped in two locations and two years in Florida under drought and moderately drought stress conditions, while the genotyping was performed using 27,957 genotyping-by-sequencing (GBS) single nucleotide polymorphism (SNP) makers. Five predictive models including Multi-environment Genomic Best Linear Unbiased Predictor (MGBLUP), Bayesian Multi-trait Multi-environment (BMTME), Bayesian Multi-output Regressor Stacking (BMORS), Single-trait Multi-environment Deep Learning (SMDL), and Multi-trait Multi-environment Deep Learning (MMDL) were compared. Across environments, the multi-trait statistical model (BMTME) was superior to the multi-trait DL model for prediction accuracy in most scenarios, but the DL models were comparable to the statistical models for response to selection. The multi-trait model also showed 5 to 22% more genetic gain compared to the single-trait model across environment reflected by the response to selection. Overall, these results suggest that multi-trait genomic prediction can be an efficient strategy for economically important yield component related traits in soft wheat.

Download Full-text

98 Using differential evolution to improve predictive accuracy of deep learning models applied to pig production data

Journal of Animal Science ◽

10.1093/jas/skaa054.048 ◽

2020 ◽

Vol 98 (Supplement_3) ◽

pp. 27-27

Author(s):

Junjie Han ◽

Cedric Gondro ◽

Juan Steibel

Keyword(s):

Deep Learning ◽

Differential Evolution ◽

Image Classification ◽

Genomic Prediction ◽

Prediction Accuracy ◽

Predictive Accuracy ◽

Traditional Approach ◽

Exhaustive Search ◽

Pig Production ◽

Hyperparameter Selection

Abstract Deep learning (DL) is being used for prediction in precision livestock farming and in genomic prediction. However, optimizing hyperparameters in DL models is critical for their predictive performance. Grid search is the traditional approach to select hyperparameters in DL, but it requires exhaustive search over the parameter space. We propose hyperparameter selection using differential evolution (DE), which is a heuristic algorithm that does not require exhaustive search. The goal of this study was to design and apply DE to optimize hyperparameters of DL models for genomic prediction and image analysis in pig production systems. One dataset consisted of 910 pigs genotyped with 28,916 SNP markers to predict their post-mortem meat pH. Another dataset consisted of 1,334 images of pigs eating inside a single-spaced feeder classified as: “single pig” or “multiple pigs.” The accuracy of genomic prediction was defined as the correlation between the predicted pH and the observed pH. The image classification prediction accuracy was the proportion of correctly classified images. For genomic prediction, a multilayer perceptron (MLP) was optimized. For image classification, MLP and convolutional neural networks (CNN) were optimized. For genomic prediction, the initial hyperparameter set resulted in an accuracy of 0.032 and for image classification, the initial accuracy was between 0.72 and 0.76. After optimization using DE, the genomic prediction accuracy was 0.3688 compared to 0.334 using GBLUP. The top selected models included one layer, 60 neurons, sigmoid activation and L2 penalty = 0.3. The accuracy of image classification after optimization was between 0.89 and 0.92. Selected models included three layers, adamax optimizer and relu or elu activation for the MLP, and one layer, 64 filters and 5×5 filter size for the CNN. DE can adapt the hyperparameter selection to each problem, dataset and model, and it significantly increased prediction accuracy with minimal user input.

Download Full-text

Using local convolutional neural networks for genomic prediction

10.1101/2020.05.12.090118 ◽

2020 ◽

Author(s):

Torsten Pook ◽

Jan Freudenthal ◽

Arthur Korte ◽

Henner Simianer

Keyword(s):

Neural Network ◽

Neural Networks ◽

Deep Learning ◽

Convolutional Neural Network ◽

Convolutional Neural Networks ◽

Genomic Prediction ◽

Linear Models ◽

Predictive Ability ◽

High Heritability ◽

Fully Connected

ABSTRACTThe prediction of breeding values and phenotypes is of central importance for both livestock and crop breeding. With increasing computational power and more and more data to potentially utilize, Machine Learning and especially Deep Learning have risen in popularity over the last few years. In this study, we are proposing the use of local convolutional neural networks for genomic prediction, as a region specific filter corresponds much better with our prior genetic knowledge of traits than traditional convolutional neural networks. Model performances are evaluated on a simulated maize data panel (n = 10,000) and real Arabidopsis data (n = 2,039) for a variety of traits with the local convolutional neural network outperforming both multi layer perceptrons and convolutional neural networks for basically all considered traits. Linear models like the genomic best linear unbiased prediction that are often used for genomic prediction are outperformed by up to 24%. Highest gains in predictive ability was obtained in cases of medium trait complexity with high heritability and large training populations. However, for small dataset with 100 or 250 individuals for the training of the models, the local convolutional neural network is performing slightly worse than the linear models. Nonetheless, this is still 15% better than a traditional convolutional neural network, indicating a better performance and robustness of our proposed model architecture for small training populations. In addition to the baseline model, various other architectures with different windows size and stride in the local convolutional layer, as well as different number of nodes in subsequent fully connected layers are compared against each other. Finally, the usefulness of Deep Learning and in particular local convolutional neural networks in practice is critically discussed, in regard to multi dimensional inputs and outputs, computing times and other potential hazards.

Download Full-text