Haplotype associated RNA expression (HARE) improves prediction of complex traits in maize

Genomic prediction typically relies on associations between single-site polymorphisms and traits of interest. This representation of genomic variability has been successful for predicting many complex traits. However, it usually cannot capture the combination of alleles in haplotypes and it has generated little insight about the biological function of polymorphisms. Here we present a novel and cost-effective method for imputing cis haplotype associated RNA expression (HARE), studied their transferability across tissues, and evaluated genomic prediction models within and across populations. HARE focuses on tightly linked cis acting causal variants in the immediate vicinity of the gene, while excluding trans effects from diffusion and metabolism. Therefore, HARE estimates were more transferrable across different tissues and populations compared to measured transcript expression. We also showed that HARE estimates captured one-third of the variation in gene expression. HARE estimates were used in genomic prediction models evaluated within and across two diverse maize panels–a diverse association panel (Goodman Association panel) and a large half-sib panel (Nested Association Mapping panel)–for predicting 26 complex traits. HARE resulted in up to 15% higher prediction accuracy than control approaches that preserved haplotype structure, suggesting that HARE carried functional information in addition to information about haplotype structure. The largest increase was observed when the model was trained in the Nested Association Mapping panel and tested in the Goodman Association panel. Additionally, HARE yielded higher within-population prediction accuracy as compared to measured expression values. The accuracy achieved by measured expression was variable across tissues, whereas accuracy by HARE was more stable across tissues. Therefore, imputing RNA expression of genes by haplotype is stable, cost-effective, and transferable across populations.

Download Full-text

Haplotype Associated RNA Expression (HARE) Improves Prediction of Complex Traits in Maize

10.1101/2021.04.30.442099 ◽

2021 ◽

Author(s):

Anju Giri ◽

Merritt Khaipho-Burch ◽

Edward S. Buckler ◽

Guillaume P. Ramstein

Keyword(s):

Genomic Prediction ◽

Complex Traits ◽

Gene Sequence ◽

Prediction Models ◽

Cost Effective ◽

Rna Expression ◽

Association Panel ◽

Nested Association Mapping ◽

Mapping Panel ◽

Expression Of Genes

AbstractGenomic prediction typically relies on associations between single-site polymorphisms and traits of interest. This representation of genomic variability has been successful for prediction within populations. However, it usually cannot capture the complex effects due to combination of alleles in haplotypes. Therefore, accuracy across populations has usually been low. Here we present a novel and cost-effective method for imputing cis haplotype associated RNA expression (HARE, RNA expression of genes by haplotype), studied their transferability across tissues, and evaluated genomic prediction models within and across populations. HARE focuses on tightly linked cis acting causal variants in the immediate vicinity of the gene, while excluding trans effects from diffusion and metabolism, so it would be more transferrable across different tissues and populations. We showed that HARE estimates captured one-third of the variation in gene expression and were more transferable across diverse tissues than the measured transcript expression. HARE estimates were used in genomic prediction models evaluated within and across two diverse maize panels – a diverse association panel (Goodman Association panel) and a large half-sib panel (Nested Association Mapping panel) – for predicting 26 complex traits. HARE resulted in up to 15% higher prediction accuracy than control approaches that preserved haplotype structure, suggesting that HARE carried functional information in addition to information about haplotype structure. The largest increase was observed when the model was trained in the Nested Association Mapping panel and tested in the Goodman Association panel. Additionally, HARE yielded higher within-population prediction accuracy as compared to measured expression values. The accuracy achieved by measured expression was variable across tissues whereas accuracy using HARE was more stable across tissues. Therefore, imputing RNA expression of genes by haplotype is stable, cost-effective, and transferable across populations.Author summaryThe increasing availability of genomic data has been widely used in the prediction of many traits. However, genome wide prediction has been mostly carried out within populations and without explicit modeling of RNA or protein expression. In this study, we explored the prediction of field traits within and across populations using estimated RNA expression attributable to only the DNA sequence around a gene. We showed that the estimated RNA expression was more transferable than overall measured RNA expression. We improved prediction of field traits up to 15% using estimated gene expression as compared to observed expression or gene sequence alone. Overall, these findings indicate that structural and functional information in the gene sequence are highly transferable.

Download Full-text

Pitfalls and Remedies for Cross Validation with Multi-trait Genomic Prediction Methods

10.1101/595397 ◽

2019 ◽

Author(s):

Daniel Runcie ◽

Hao Cheng

Keyword(s):

Genomic Prediction ◽

Prediction Accuracy ◽

Cross Validation ◽

Prediction Models ◽

Selection Index ◽

Parametric Method ◽

Multiple Traits ◽

Gold Standard Method ◽

Secondary Traits ◽

Validation Strategy

ABSTRACTIncorporating measurements on correlated traits into genomic prediction models can increase prediction accuracy and selection gain. However, multi-trait genomic prediction models are complex and prone to overfitting which may result in a loss of prediction accuracy relative to single-trait genomic prediction. Cross-validation is considered the gold standard method for selecting and tuning models for genomic prediction in both plant and animal breeding. When used appropriately, cross-validation gives an accurate estimate of the prediction accuracy of a genomic prediction model, and can effectively choose among disparate models based on their expected performance in real data. However, we show that a naive cross-validation strategy applied to the multi-trait prediction problem can be severely biased and lead to sub-optimal choices between single and multi-trait models when secondary traits are used to aid in the prediction of focal traits and these secondary traits are measured on the individuals to be tested. We use simulations to demonstrate the extent of the problem and propose three partial solutions: 1) a parametric solution from selection index theory, 2) a semi-parametric method for correcting the cross-validation estimates of prediction accuracy, and 3) a fully non-parametric method which we call CV2*: validating model predictions against focal trait measurements from genetically related individuals. The current excitement over high-throughput phenotyping suggests that more comprehensive phenotype measurements will be useful for accelerating breeding programs. Using an appropriate cross-validation strategy should more reliably determine if and when combining information across multiple traits is useful.

Download Full-text

Natural variation and genomic prediction of growth, physiological traits, and nitrogen-use efficiency in perennial ryegrass under low-nitrogen stress

Journal of Experimental Botany ◽

10.1093/jxb/eraa388 ◽

2020 ◽

Vol 71 (20) ◽

pp. 6670-6683

Author(s):

Xiongwei Zhao ◽

Gang Nie ◽

Yanyu Yao ◽

Zhongjie Ji ◽

Jianhua Gao ◽

...

Keyword(s):

Perennial Ryegrass ◽

Nitrogen Use Efficiency ◽

Ridge Regression ◽

Genomic Prediction ◽

Prediction Accuracy ◽

Prediction Models ◽

Physiological Traits ◽

Grass Species ◽

Nitrogen Use ◽

Use Efficiency

Abstract Genomic prediction of nitrogen-use efficiency (NUE) has not previously been studied in perennial grass species exposed to low-N stress. Here, we conducted a genomic prediction of physiological traits and NUE in 184 global accessions of perennial ryegrass (Lolium perenne) in response to a normal (7.5 mM) and low (0.75 mM) supply of N. After 21 d of treatment under greenhouse conditions, significant variations in plant height increment (ΔHT), leaf fresh weight (LFW), leaf dry weight (LDW), chlorophyll index (Chl), chlorophyll fluorescence, leaf N and carbon (C) contents, C/N ratio, and NUE were observed in accessions , but to a greater extent under low-N stress. Six genomic prediction models were applied to the data, namely the Bayesian method Bayes C, Bayesian LASSO, Bayesian Ridge Regression, Ridge Regression-Best Linear Unbiased Prediction, Reproducing Kernel Hilbert Spaces, and randomForest. These models produced similar prediction accuracy of traits within the normal or low-N treatments, but the accuracy differed between the two treatments. ΔHT, LFW, LDW, and C were predicted slightly better under normal N with a mean Pearson r-value of 0.26, compared with r=0.22 under low N, while the prediction accuracies for Chl, N, C/N, and NUE were significantly improved under low-N stress with a mean r=0.45, compared with r=0.26 under normal N. The population panel contained three population structures, which generally had no effect on prediction accuracy. The moderate prediction accuracies obtained for N, C, and NUE under low-N stress are promising, and suggest a feasible means by which germplasm might be initially assessed for further detailed studies in breeding programs.

Download Full-text

Genomic Prediction with Genotype by Environment Interaction Analysis for Kernel Zinc Concentration in Tropical Maize Germplasm

G3 Genes|Genome|Genetics ◽

10.1534/g3.120.401172 ◽

2020 ◽

Vol 10 (8) ◽

pp. 2629-2639

Author(s):

Edna K. Mageto ◽

Jose Crossa ◽

Paulino Pérez-Rodríguez ◽

Thanda Dhliwayo ◽

Natalia Palacios-Rojas ◽

...

Keyword(s):

Genomic Prediction ◽

Prediction Models ◽

Interaction Analysis ◽

Genotype By Environment Interaction ◽

Zn Deficiency ◽

Environment Interaction ◽

Prediction Ability ◽

Association Panel ◽

Maize Germplasm ◽

Zn Concentration

Zinc (Zn) deficiency is a major risk factor for human health, affecting about 30% of the world’s population. To study the potential of genomic selection (GS) for maize with increased Zn concentration, an association panel and two doubled haploid (DH) populations were evaluated in three environments. Three genomic prediction models, M (M1: Environment + Line, M2: Environment + Line + Genomic, and M3: Environment + Line + Genomic + Genomic x Environment) incorporating main effects (lines and genomic) and the interaction between genomic and environment (G x E) were assessed to estimate the prediction ability (rMP) for each model. Two distinct cross-validation (CV) schemes simulating two genomic prediction breeding scenarios were used. CV1 predicts the performance of newly developed lines, whereas CV2 predicts the performance of lines tested in sparse multi-location trials. Predictions for Zn in CV1 ranged from -0.01 to 0.56 for DH1, 0.04 to 0.50 for DH2 and -0.001 to 0.47 for the association panel. For CV2, rMP values ranged from 0.67 to 0.71 for DH1, 0.40 to 0.56 for DH2 and 0.64 to 0.72 for the association panel. The genomic prediction model which included G x E had the highest average rMP for both CV1 (0.39 and 0.44) and CV2 (0.71 and 0.51) for the association panel and DH2 population, respectively. These results suggest that GS has potential to accelerate breeding for enhanced kernel Zn concentration by facilitating selection of superior genotypes.

Download Full-text

Accounting for Group-Specific Allele Effects and Admixture in Genomic Predictions: Theory and Experimental Evaluation in Maize

Genetics ◽

10.1534/genetics.120.303278 ◽

2020 ◽

Vol 216 (1) ◽

pp. 27-41

Author(s):

Simon Rio ◽

Laurence Moreau ◽

Alain Charcosset ◽

Tristan Mary-Huard

Keyword(s):

Genomic Prediction ◽

Prediction Accuracy ◽

Prediction Models ◽

Best Linear Unbiased Prediction ◽

Linear Unbiased Prediction ◽

Modeling Group ◽

A Genome ◽

Specific Allele ◽

Best Linear Unbiased ◽

Unbiased Prediction

Populations structured into genetic groups may display group-specific linkage disequilibrium, mutations, and/or interactions between quantitative trait loci and the genetic background. These factors lead to heterogeneous marker effects affecting the efficiency of genomic prediction, especially for admixed individuals. Such individuals have a genome that is a mosaic of chromosome blocks from different origins, and may be of interest to combine favorable group-specific characteristics. We developed two genomic prediction models adapted to the prediction of admixed individuals in presence of heterogeneous marker effects: multigroup admixed genomic best linear unbiased prediction random individual (MAGBLUP-RI), modeling the ancestry of alleles; and multigroup admixed genomic best linear unbiased prediction random allele effect (MAGBLUP-RAE), modeling group-specific distributions of allele effects. MAGBLUP-RI can estimate the segregation variance generated by admixture while MAGBLUP-RAE can disentangle the variability that is due to main allele effects from the variability that is due to group-specific deviation allele effects. Both models were evaluated for their genomic prediction accuracy using a maize panel including lines from the Dent and Flint groups, along with admixed individuals. Based on simulated traits, both models proved their efficiency to improve genomic prediction accuracy compared to standard GBLUP models. For real traits, a clear gain was observed at low marker densities whereas it became limited at high marker densities. The interest of including admixed individuals in multigroup training sets was confirmed using simulated traits, but was variable using real traits. Both MAGBLUP models and admixed individuals are of interest whenever group-specific SNP allele effects exist.

Download Full-text

Haplotype genomic prediction of phenotypic values based on chromosome distance and gene boundaries using low-coverage sequencing in Duroc pigs

Genetics Selection Evolution ◽

10.1186/s12711-021-00661-y ◽

2021 ◽

Vol 53 (1) ◽

Author(s):

Cheng Bian ◽

Dzianis Prakapenka ◽

Cheng Tan ◽

Ruifei Yang ◽

Di Zhu ◽

...

Keyword(s):

Genomic Selection ◽

Genomic Prediction ◽

Prediction Accuracy ◽

Prediction Models ◽

Average Daily Gain ◽

Live Weight ◽

Feed Conversion ◽

Muscle Area ◽

Haplotype Blocks ◽

Low Coverage

Abstract Background Genomic selection using single nucleotide polymorphism (SNP) markers has been widely used for genetic improvement of livestock, but most current methods of genomic selection are based on SNP models. In this study, we investigated the prediction accuracies of haplotype models based on fixed chromosome distances and gene boundaries compared to those of SNP models for genomic prediction of phenotypic values. We also examined the reasons for the successes and failures of haplotype genomic prediction. Methods We analyzed a swine population of 3195 Duroc boars with records on eight traits: body judging score (BJS), teat number (TN), age (AGW), loin muscle area (LMA), loin muscle depth (LMD) and back fat thickness (BF) at 100 kg live weight, and average daily gain (ADG) and feed conversion rate (FCR) from 30 to100 kg live weight. Ten-fold validation was used to evaluate the prediction accuracy of each SNP model and each multi-allelic haplotype model based on 488,124 autosomal SNPs from low-coverage sequencing. Haplotype blocks were defined using fixed chromosome distances or gene boundaries. Results Compared to the best SNP model, the accuracy of predicting phenotypic values using a haplotype model was greater by 7.4% for BJS, 7.1% for AGW, 6.6% for ADG, 4.9% for FCR, 2.7% for LMA, 1.9% for LMD, 1.4% for BF, and 0.3% for TN. The use of gene-based haplotype blocks resulted in the best prediction accuracy for LMA, LMD, and TN. Compared to estimates of SNP additive heritability, estimates of haplotype epistasis heritability were strongly correlated with the increase in prediction accuracy by haplotype models. The increase in prediction accuracy was largest for BJS, AGW, ADG, and FCR, which also had the largest estimates of haplotype epistasis heritability, 24.4% for BJS, 14.3% for AGW, 14.5% for ADG, and 17.7% for FCR. SNP and haplotype heritability profiles across the genome identified several genes with large genetic contributions to phenotypes: NUDT3 for LMA, LMD and BF, VRTN for TN, COL5A2 for BJS, BSND for ADG, and CARTPT for FCR. Conclusions Haplotype prediction models improved the accuracy for genomic prediction of phenotypes in Duroc pigs. For some traits, the best prediction accuracy was obtained with haplotypes defined using gene regions, which provides evidence that functional genomic information can improve the accuracy of haplotype genomic prediction for certain traits.

Download Full-text

Genetic architecture and genomic prediction accuracy of apple quantitative traits across environments

10.1101/2021.11.29.470309 ◽

2021 ◽

Author(s):

Michaela Jung ◽

Beat Keller ◽

Morgane Roth ◽

Maria Jose Aranzana ◽

Annemarie Auwerkerken ◽

...

Keyword(s):

Genomic Prediction ◽

Prediction Accuracy ◽

Genetic Architecture ◽

Quantitative Traits ◽

Prediction Models ◽

Phenotypic Variability ◽

Reference Population ◽

Genomic Study ◽

Genomic Tools ◽

Breeding Efficiency

Implementation of genomic tools is desirable to increase the efficiency of apple breeding. The apple reference population (apple REFPOP) proved useful for rediscovering loci, estimating genomic prediction accuracy, and studying genotype by environment interactions (GxE). Here we show contrasting genetic architecture and genomic prediction accuracies for 30 quantitative traits across up to six European locations using the apple REFPOP. A total of 59 stable and 277 location-specific associations were found using GWAS, 69.2% of which are novel when compared with 41 reviewed publications. Average genomic prediction accuracies of 0.18-0.88 were estimated using single-environment univariate, single-environment multivariate, multi-environment univariate, and multi-environment multivariate models. The GxE accounted for up to 24% of the phenotypic variability. This most comprehensive genomic study in apple in terms of trait-environment combinations provided knowledge of trait biology and prediction models that can be readily applied for marker-assisted or genomic selection, thus facilitating increased breeding efficiency.

Download Full-text

Accuracy of gene expression prediction from genotype data with PrediXcan varies across diverse populations

10.1101/524728 ◽

2019 ◽

Cited By ~ 1

Author(s):

Anna Mikhaylova ◽

Timothy Thornton

Keyword(s):

Gene Expression ◽

Complex Traits ◽

Prediction Accuracy ◽

Prediction Models ◽

Predictive Performance ◽

Training Data ◽

Supervised Machine Learning ◽

Whole Genome Sequencing Data ◽

P Value ◽

Diverse Populations

AbstractPredicting gene expression with genetic data has garnered significant attention in recent years. PrediXcan is one of the most widely used gene-based association methods for testing imputed gene expression values with a phenotype due to the invaluable insight the method has shown into the relationship between complex traits and the component of gene expression that can be attributed to genetic variation. The prediction models for PrediXcan, however, were obtained using supervised machine learning methods and training data from the Depression and Gene Network (DGN) and the Genotype-Tissue Expression (GTEx) data, where the majority of subjects are of European descent. Many genetic studies, however, include samples from multi-ethnic populations, and in this paper we assess the accuracy of gene expression predictions with PrediXcan in diverse populations. Using transcriptomic data from the GEUVADIS (Genetic European Variation in Health and Disease) RNA sequencing project and whole genome sequencing data from the 1000 Genomes project, we evaluate and compare the predictive performance of PrediXcan in an African population (Yoruban) and four European populations. Prediction results are obtained using a range of models from PrediXcan weight databases, and Pearson’s correlation coefficient is used to measure prediction accuracy. We demonstrate that the predictive performance of PrediXcan varies across populations (F-test p-value < 0.001), where prediction accuracy is the worst in the Yoruban sample compared to European samples. Moreover, the performance of PrediXcan varies not only among distant populations, but also among closely related populations as well. We also find that the qualitative performance of PrediXcan for the populations considered is consistent across all weight databases used.

Download Full-text

Harnessing Genetic Diversity in the USDA Pea Germplasm Collection Through Genomic Prediction

Frontiers in Genetics ◽

10.3389/fgene.2021.707754 ◽

2021 ◽

Vol 12 ◽

Author(s):

Md. Abdullah Al Bari ◽

Ping Zheng ◽

Indalecio Viera ◽

Hannah Worral ◽

Stephen Szwiec ◽

...

Keyword(s):

Genetic Diversity ◽

Seed Yield ◽

Genomic Prediction ◽

Complex Traits ◽

Prediction Models ◽

Germplasm Collection ◽

Predictive Ability ◽

Snp Markers ◽

Breeding Values ◽

Germplasm Collections

Phenotypic evaluation and efficient utilization of germplasm collections can be time-intensive, laborious, and expensive. However, with the plummeting costs of next-generation sequencing and the addition of genomic selection to the plant breeder’s toolbox, we now can more efficiently tap the genetic diversity within large germplasm collections. In this study, we applied and evaluated genomic prediction’s potential to a set of 482 pea (Pisum sativum L.) accessions—genotyped with 30,600 single nucleotide polymorphic (SNP) markers and phenotyped for seed yield and yield-related components—for enhancing selection of accessions from the USDA Pea Germplasm Collection. Genomic prediction models and several factors affecting predictive ability were evaluated in a series of cross-validation schemes across complex traits. Different genomic prediction models gave similar results, with predictive ability across traits ranging from 0.23 to 0.60, with no model working best across all traits. Increasing the training population size improved the predictive ability of most traits, including seed yield. Predictive abilities increased and reached a plateau with increasing number of markers presumably due to extensive linkage disequilibrium in the pea genome. Accounting for population structure effects did not significantly boost predictive ability, but we observed a slight improvement in seed yield. By applying the best genomic prediction model (e.g., RR-BLUP), we then examined the distribution of genotyped but nonphenotyped accessions and the reliability of genomic estimated breeding values (GEBV). The distribution of GEBV suggested that none of the nonphenotyped accessions were expected to perform outside the range of the phenotyped accessions. Desirable breeding values with higher reliability can be used to identify and screen favorable germplasm accessions. Expanding the training set and incorporating additional orthogonal information (e.g., transcriptomics, metabolomics, physiological traits, etc.) into the genomic prediction framework can enhance prediction accuracy.

Download Full-text

Optimizing Low-Cost Genotyping and Imputation Strategies for Genomic Selection in Atlantic Salmon

G3 Genes|Genome|Genetics ◽

10.1534/g3.119.400800 ◽

2019 ◽

Vol 10 (2) ◽

pp. 581-590 ◽

Cited By ~ 4

Author(s):

Smaragda Tsairidou ◽

Alastair Hamilton ◽

Diego Robledo ◽

James E. Bron ◽

Ross D. Houston

Keyword(s):

Atlantic Salmon ◽

Genomic Selection ◽

Environmental Sustainability ◽

Genomic Prediction ◽

Prediction Accuracy ◽

Imputation Accuracy ◽

Cost Effective ◽

High Density ◽

Genotype Imputation ◽

Breeding Programs

Genomic selection enables cumulative genetic gains in key production traits such as disease resistance, playing an important role in the economic and environmental sustainability of aquaculture production. However, it requires genome-wide genetic marker data on large populations, which can be prohibitively expensive. Genotype imputation is a cost-effective method for obtaining high-density genotypes, but its value in aquaculture breeding programs which are characterized by large full-sibling families has yet to be fully assessed. The aim of this study was to optimize the use of low-density genotypes and evaluate genotype imputation strategies for cost-effective genomic prediction. Phenotypes and genotypes (78,362 SNPs) were obtained for 610 individuals from a Scottish Atlantic salmon breeding program population (Landcatch, UK) challenged with sea lice, Lepeophtheirus salmonis. The genomic prediction accuracy of genomic selection was calculated using GBLUP approaches and compared across SNP panels of varying densities and composition, with and without imputation. Imputation was tested when parents were genotyped for the optimal SNP panel, and offspring were genotyped for a range of lower density imputation panels. Reducing SNP density had little impact on prediction accuracy until 5,000 SNPs, below which the accuracy dropped. Imputation accuracy increased with increasing imputation panel density. Genomic prediction accuracy when offspring were genotyped for just 200 SNPs, and parents for 5,000 SNPs, was 0.53. This accuracy was similar to the full high density and optimal density dataset, and markedly higher than using 200 SNPs without imputation. These results suggest that imputation from very low to medium density can be a cost-effective tool for genomic selection in Atlantic salmon breeding programs.

Download Full-text