Importance of correcting genomic relationships in single-locus QTL mapping model with an Advanced Backcross population

Abstract Advanced Backcross (AB) populations have been widely used to identify and utilize beneficial alleles in various crops such as rice, tomato, wheat and barley. For the development of an AB population, a controlled crossing scheme is used and this controlled crossing along with the selection (both natural and artificial) of agronomically-adapted alleles during the development of AB population may lead to unbalanced allele frequencies in the population. However, it is commonly believed that interval mapping mapping of traits in experimental crosses such as AB populations are immune to the deviations from the expected frequencies under Mendelian segregation. Using two AB populations and simulated data sets as examples, we describe the severity of the problem caused by unbalanced allele frequencies in quantitative trait loci (QTL) mapping and demonstrate how it can be corrected using the linear mixed model having a polygenic effect with the covariance structure (genomic relationship matrix) calculated from molecular markers.

Download Full-text

Preprocessing Tools for Data Preparation

Multivariate Statistical Machine Learning Methods for Genomic Prediction ◽

10.1007/978-3-030-89010-0_2 ◽

2022 ◽

pp. 35-70

Author(s):

Osval Antonio Montesinos López ◽

Abelardo Montesinos López ◽

Jose Crossa

Keyword(s):

Machine Learning ◽

Mixed Model ◽

Linear Mixed Model ◽

Genomic Relationship Matrix ◽

Relationship Matrix ◽

Data Preparation ◽

Statistical Machine Learning ◽

Major Allele Frequency ◽

Major Allele ◽

Best Linear Unbiased

AbstractThis data preparation chapter is of paramount importance for implementing statistical machine learning methods for genomic selection. We present the basic linear mixed model that gives rise to BLUE and BLUP and explain how to decide when to use fixed or random effects that give rise to best linear unbiased estimates (BLUE or BLUEs) and best linear unbiased predictors (BLUP or BLUPs). The R codes for fitting linear mixed model for the data are given in small examples. We emphasize tools for computing BLUEs and BLUPs for many linear combinations of interest in genomic-enabled prediction and plant breeding. We present tools for cleaning, imputing, and detecting minor and major allele frequency computation, marker recodification, frequency of heterogeneous, frequency of NAs, and three methods for computing the genomic relationship matrix. In addition, scaling and data compression of inputs are important in statistical machine learning. For a more extensive description of linear mixed models, see Chap. 10.1007/978-3-030-89010-0_5.

Download Full-text

On the use of whole-genome sequence data for across-breed genomic prediction and fine-scale mapping of QTL

Genetics Selection Evolution ◽

10.1186/s12711-021-00607-4 ◽

2021 ◽

Vol 53 (1) ◽

Author(s):

Theo Meuwissen ◽

Irene van den Berg ◽

Mike Goddard

Keyword(s):

Variable Selection ◽

Genome Sequence ◽

Genomic Prediction ◽

Milk Fat ◽

Genotype Imputation ◽

Whole Genome Sequence ◽

Genomic Relationship Matrix ◽

Polygenic Effect ◽

Relationship Matrix ◽

Whole Genome

Abstract Background Whole-genome sequence (WGS) data are increasingly available on large numbers of individuals in animal and plant breeding and in human genetics through second-generation resequencing technologies, 1000 genomes projects, and large-scale genotype imputation from lower marker densities. Here, we present a computationally fast implementation of a variable selection genomic prediction method, that could handle WGS data on more than 35,000 individuals, test its accuracy for across-breed predictions and assess its quantitative trait locus (QTL) mapping precision. Methods The Monte Carlo Markov chain (MCMC) variable selection model (Bayes GC) fits simultaneously a genomic best linear unbiased prediction (GBLUP) term, i.e. a polygenic effect whose correlations are described by a genomic relationship matrix (G), and a Bayes C term, i.e. a set of single nucleotide polymorphisms (SNPs) with large effects selected by the model. Computational speed is improved by a Metropolis–Hastings sampling that directs computations to the SNPs, which are, a priori, most likely to be included into the model. Speed is also improved by running many relatively short MCMC chains. Memory requirements are reduced by storing the genotype matrix in binary form. The model was tested on a WGS dataset containing Holstein, Jersey and Australian Red cattle. The data contained 4,809,520 genotypes on 35,549 individuals together with their milk, fat and protein yields, and fat and protein percentage traits. Results The prediction accuracies of the Jersey individuals improved by 1.5% when using across-breed GBLUP compared to within-breed predictions. Using WGS instead of 600 k SNP-chip data yielded on average a 3% accuracy improvement for Australian Red cows. QTL were fine-mapped by locating the SNP with the highest posterior probability of being included in the model. Various QTL known from the literature were rediscovered, and a new SNP affecting milk production was discovered on chromosome 20 at 34.501126 Mb. Due to the high mapping precision, it was clear that many of the discovered QTL were the same across the five dairy traits. Conclusions Across-breed Bayes GC genomic prediction improved prediction accuracies compared to GBLUP. The combination of across-breed WGS data and Bayesian genomic prediction proved remarkably effective for the fine-mapping of QTL.

Download Full-text

The use of MapPop1.0 for choosing a QTL mapping sample from an advanced backcross population

Theoretical and Applied Genetics ◽

10.1007/s00122-006-0495-8 ◽

2007 ◽

Vol 114 (6) ◽

pp. 1019-1028 ◽

Cited By ~ 4

Author(s):

C. Birolleau-Touchard ◽

E. Hanocq ◽

A. Bouchez ◽

C. Bauland ◽

I. Dourlen ◽

...

Keyword(s):

Qtl Mapping ◽

Backcross Population ◽

Advanced Backcross ◽

Advanced Backcross Population

Download Full-text

330 A hybrid model for genomic selection using prioritized SNPs based on FST scores in the presence of non-genotyped animals

Journal of Animal Science ◽

10.1093/jas/skz258.102 ◽

2019 ◽

Vol 97 (Supplement_3) ◽

pp. 51-51

Author(s):

Sajjad Toghiani ◽

Ling-Yun Chang ◽

El H Hay ◽

Andrew J Roberts ◽

Samuel E Aggrey ◽

...

Keyword(s):

Genomic Selection ◽

Hybrid Approach ◽

Computational Cost ◽

Simulated Data ◽

Snp Markers ◽

Genomic Relationship Matrix ◽

Polygenic Effect ◽

Relationship Matrix ◽

Continuous Increase ◽

Missing Genotypes

Abstract The dramatic advancement in genotyping technology has greatly reduced the complexity and cost of genotyping. The continuous increase in the density of marker panels is resulting in little to no improvement in the accuracy of genomic selection. Direct inversion of the genomic relationship matrix is infeasible for some livestock populations due to the excessive computational cost. In addition, most animals in genetic evaluation programs are non-genotyped. Including these animals in a genomic evaluation requires the imputation of the missing genotypes when using regression methods. To overcome these challenges, a hybrid approach is proposed. This approach fits a subset of SNP markers selected based on FST scores and a classical polygenic effect. The method was first tested using only genotyped animals and then extended to accommodate non-genotyped animals. The proposed approach was evaluated using simulated data for a trait with heritability of 0.1 and 0.4 and weaning weight in a crossbred beef cattle population. When all animals were genotyped, the hybrid approach using only 2.5% of prioritized SNPs exceeded the prediction accuracies of BayesB, BayesC, and GBLUP by more than 7%. When non-genotyped animals were incorporated, the proposed approach significantly outperformed ss-GBLUP method in terms of prediction accuracy under both simulated heritability scenarios. Although the results seem to depend on the genetic complexity of the trait, the proposed approach resulted in higher prediction accuracies than current methods. Furthermore, its computational costs in terms of CPU time and peak memory are substantially lower than the current methods.

Download Full-text

Estimating SNP heritability in presence of population substructure in biobank-scale datasets

10.1101/2020.08.05.236901 ◽

2020 ◽

Author(s):

Zhaotong Lin ◽

Souvik Seal ◽

Saonli Basu

Keyword(s):

Complex Traits ◽

Population Stratification ◽

Mixed Model ◽

Linear Mixed Model ◽

Population Substructure ◽

Relationship Matrix ◽

Phenotypic Variance ◽

Genetic Contribution ◽

Heritability Estimation ◽

The Impact

AbstractSNP heritability of a trait is measured by the proportion of total variance explained by the additive effects of genome-wide single nucleotide polymorphisms (SNPs). Linear mixed models are routinely used to estimate SNP heritability for many complex traits. The basic concept behind this approach is to model genetic contribution as a random effect, where the variance of this genetic contribution attributes to the heritability of the trait. This linear mixed model approach requires estimation of ‘relatedness’ among individuals in the sample, which is usually captured by estimating a genetic relationship matrix (GRM). Heritability is estimated by the restricted maximum likelihood (REML) or method of moments (MOM) approaches, and this estimation relies heavily on the GRM computed from the genetic data on individuals. Presence of population substructure in the data could significantly impact the GRM estimation and may introduce bias in heritability estimation. The common practice of accounting for such population substructure is to adjust for the top few principal components of the GRM as covariates in the linear mixed model. Here we propose an alternative way of estimating heritability in multi-ethnic studies. Our proposed approach is a MOM estimator derived from the Haseman-Elston regression and gives an asymptotically unbiased estimate of heritability in presence of population stratification. It introduces adjustments for the population stratification in a second-order estimating equation and allows for the total phenotypic variance vary by ethnicity. We study the performance of different MOM and REML approaches in presence of population stratification through extensive simulation studies. We estimate the heritability of height, weight and other anthropometric traits in the UK Biobank cohort to investigate the impact of subtle population substructure on SNP heritability estimation.

Download Full-text

Permutation Testing in the Presence of Polygenic Variation

10.1101/014571 ◽

2015 ◽

Author(s):

Mark Abney

Keyword(s):

Quantitative Trait ◽

Mixed Model ◽

Linear Mixed Model ◽

Permutation Test ◽

Statistical Significance ◽

Null Distribution ◽

Polygenic Effect ◽

Permutation Testing ◽

Test Statistic ◽

Omnibus Test

This article discusses problems with and solutions to performing valid permutation tests for quantitative trait loci in the presence of polygenic effects. Although permutation testing is a popular approach for determining statistical significance of a test statistic with an unknown distribution--for instance, the maximum of multiple correlated statistics or some omnibus test statistic for a gene, gene-set or pathway--naive application of permutations may result in an invalid test. The risk of performing an invalid permutation test is particularly acute in complex trait mapping where polygenicity may combine with a structured population resulting from the presence of families, cryptic relatedness, admixture or population stratification. I give both analytical derivations and a conceptual understanding of why typical permutation procedures fail and suggest an alternative permutation based algorithm, MVNpermute, that succeeds. In particular, I examine the case where a linear mixed model is used to analyze a quantitative trait and show that both phenotype and genotype permutations may result in an invalid permutation test. I provide a formula that predicts the amount of inflation of the type 1 error rate depending on the degree of misspecification of the covariance structure of the polygenic effect and the heritability of the trait. I validate this formula by doing simulations, showing that the permutation distribution matches the theoretical expectation, and that my suggested permutation based test obtains the correct null distribution. Finally, I discuss situations where naive permutations of the phenotype or genotype are valid and the applicability of the results to other test statistics.

Download Full-text

Estimation of Genetic Variance Contributed by a Quantitative Trait Locus — Correcting the Bias Associated with Significance Tests

Genetics ◽

10.1093/genetics/iyab115 ◽

2021 ◽

Author(s):

Fangjie Xie ◽

Shibo Wang ◽

William D Beavis ◽

Shizhong Xu

Keyword(s):

Qtl Mapping ◽

Effect Size ◽

Bias Correction ◽

Mixed Model ◽

Linear Mixed Model ◽

Association Studies ◽

Chromosome 9 ◽

Genome Wide Association Studies ◽

Test Statistic ◽

Chi Square

Abstract The Beavis effect in QTL mapping describes a phenomenon that the estimated effect size of a statistically significant QTL (measured by the QTL variance) is greater than the true effect size of the QTL if the sample size is not sufficiently large. This is a typical example of the Winners’ curse applied to molecular quantitative genetics. Theoretical evaluation and correction for the Winners’ curse have been studied for interval mapping. However, similar technologies have not been available for current models of QTL mapping and genome-wide association studies where a polygene is often included in the linear mixed models to control the genetic background effect. In this study, we developed the theory of the Beavis effect in a linear mixed model using a truncated non-central Chi-square distribution. We equated the observed Wald test statistic of a significant QTL to the expectation of a truncated non-central Chi-square distribution to obtain a bias-corrected estimate of the QTL variance. The results are validated from replicated Monte Carlo simulation experiments. We applied the new method to the grain width (GW) trait of a rice population consisting of 524 homozygous varieties with over 300k single nucleotide polymorphism (SNPs) markers. Two loci were identified and the estimated QTL heritability were corrected for the Beavis effect. Bias correction for the larger QTL on chromosome 5 (GW5) with an estimated heritability of 12% did not change the QTL heritability due to the extremely large test score and estimated QTL effect. The smaller QTL on chromosome 9 (GW9) had an estimated QTL heritability of 9% reduced to 6% after the bias-correction.

Download Full-text

Genomic Heritability: A Ragged Diagonal Between Bias and Variance

10.1101/2021.09.19.460999 ◽

2021 ◽

Author(s):

Mitchell J. Feldmann ◽

Hans-Peter Piepho ◽

Steven J. Knapp

Keyword(s):

Mixed Model ◽

Dna Polymorphisms ◽

Breeding Value ◽

Genomic Relationship Matrix ◽

Relationship Matrix ◽

Genomic Relationship ◽

Model Framework ◽

Kinship Matrix ◽

Genomic Heritability ◽

A Genome

Many important traits in plants, animals, and microbes are polygenic and are therefore difficult to improve through traditional marker?assisted selection. Genomic prediction addresses this by enabling the inclusion of all genetic data in a mixed model framework. The main method for predicting breeding values is genomic best linear unbiased prediction (GBLUP), which uses the realized genomic relationship or kinship matrix (K) to connect genotype to phenotype. The use of relationship matrices allows information to be shared for estimating the genetic values for observed entries and predicting genetic values for unobserved entries. One of the key parameters of such models is genomic heritability (h2g), or the variance of a trait associated with a genome-wide sample of DNA polymorphisms. Here we discuss the relationship between several common methods for calculating the genomic relationship matrix and propose a new matrix based on the average semivariance that yields accurate estimates of genomic variance in the observed population regardless of the focal population quality as well as accurate breeding value predictions in unobserved samples. Notably, our proposed method is highly similar to the approach presented by Legarra (2016) despite different mathematical derivations and statistical perspectives and only deviates from the classic approach presented in VanRaden (2008) by a scaling factor. With current approaches, we found that the genomic heritability tends to be either over- or underestimated depending on the scaling and centering applied to the marker matrix (Z), the value of the average diagonal element of K, and the assortment of alleles and heterozygosity (H) in the observed population and that, unlike its predecessors, our newly proposed kinship matrix KASV yields accurate estimates of h2g in the observed population, generalizes to larger populations, and produces BLUPs equivalent to common methods in plants and animals.

Download Full-text

Purebred-crossbred genetic parameters for reproductive traits in swine

Journal of Animal Science ◽

10.1093/jas/skab270 ◽

2021 ◽

Author(s):

Luke M Kramer ◽

Ania Wolc ◽

Hadi Esfandyari ◽

Dinesh M Thekkoot ◽

Chunyan Zhang ◽

...

Keyword(s):

Genetic Parameters ◽

Additive Genetic Variance ◽

Genetic Correlations ◽

Reproductive Traits ◽

Allele Frequencies ◽

Superior Performance ◽

Joint Analysis ◽

Genomic Relationship Matrix ◽

Parameter Estimates ◽

Relationship Matrix

Abstract For swine breeding programs, testing and selection programs are usually within purebred (PB) populations located in nucleus units that are generally managed differently and tend to have a higher health level than the commercial herds in which the crossbred (CB) descendants of these nucleus animals are expected to perform. This approach assumes that PB animals selected in the nucleus herd will have CB progeny that have superior performance at the commercial level. There is clear evidence that this may not be the case for all traits of economic importance and, thus, including data collected at the commercial herd level may increase the accuracy of selection for commercial CB performance at the nucleus level. The goal for this study was to estimate genetic parameters for five maternal reproductive traits between two PB maternal nucleus populations (Landrace and Yorkshire) and their CB offspring: Total Number Born (TNB), Number Born Alive (NBA), Number Born Alive > 1 kg (NBA>1kg), Total Number Weaned (TNW), and Litter Weight at Weaning (LWW). Estimates were based on single-step GBLUP by analyzing any two combinations of a PB and the CB population, and by analyzing all three populations jointly. The genomic relationship matrix between the three populations was generated by using within population allele frequencies for relationships within a population, and across population allele frequencies for relationships of the CB with the PB animals. Utilization of metafounders for the two PB populations had no effect on parameter estimates, so the two PB populations were assumed to be genetically unrelated. Joint analysis of two (one PB plus CB) versus three (both PB and CB) populations did not impact estimates of heritability, additive genetic variance, and genetic correlations. Heritabilities were generally similar between the PB and CB populations, except for LWW and TNW, for which PB populations had about four times larger estimates than CB. Purebred-crossbred genetic correlations () were larger for Landrace than for Yorkshire, except for NBA>1kg. These estimates of indicate that there is potential to improve selection of PB animals for CB performance by including CB information for all traits in the Yorkshire population, but that noticeable additional gains may only occur for NBA>1kg and TNW in the Landrace population.

Download Full-text

Pooled genotyping strategies for the rapid construction of genomic reference populations1

Journal of Animal Science ◽

10.1093/jas/skz344 ◽

2019 ◽

Vol 97 (12) ◽

pp. 4761-4769 ◽

Cited By ~ 2

Author(s):

Pâmela A Alexandre ◽

Laercio R Porto-Neto ◽

Emre Karaman ◽

Sigrid A Lehnert ◽

Antonio Reverter

Keyword(s):

Mixed Model ◽

Cost Savings ◽

Cost Effective ◽

Pedigree Information ◽

Genomic Relationship Matrix ◽

Relationship Matrix ◽

Phenotypic Data ◽

Feasible Alternative ◽

Cattle Herds ◽

Estimated Breeding Values

Abstract The growing concern with the environment is making important for livestock producers to focus on selection for efficiency-related traits, which is a challenge for commercial cattle herds due to the lack of pedigree information. To explore a cost-effective opportunity for genomic evaluations of commercial herds, this study compared the accuracy of bulls’ genomic estimated breeding values (GEBV) using different pooled genotype strategies. We used ten replicates of previously simulated genomic and phenotypic data for one low (t1) and one moderate (t2) heritability trait of 200 sires and 2,200 progeny. Sire’s GEBV were calculated using a univariate mixed model, with a hybrid genomic relationship matrix (h-GRM) relating sires to: 1) 1,100 pools of 2 animals; 2) 440 pools of 5 animals; 3) 220 pools of 10 animals; 4) 110 pools of 20 animals; 5) 88 pools of 25 animals; 6) 44 pools of 50 animals; and 7) 22 pools of 100 animals. Pooling criteria were: at random, grouped sorting by t1, grouped sorting by t2, and grouped sorting by a combination of t1 and t2. The same criteria were used to select 110, 220, 440, and 1,100 individual genotypes for GEBV calculation to compare GEBV accuracy using the same number of individual genotypes and pools. Although the best accuracy was achieved for a given trait when pools were grouped based on that same trait (t1: 0.50–0.56, t2: 0.66–0.77), pooling by one trait impacted negatively on the accuracy of GEBV for the other trait (t1: 0.25–0.46, t2: 0.29–0.71). Therefore, the combined measure may be a feasible alternative to use the same pools to calculate GEBVs for both traits (t1: 0.45–0.57, t2: 0.62–0.76). Pools of 10 individuals were identified as representing a good compromise between loss of accuracy (~10%–15%) and cost savings (~90%) from genotype assays. In addition, we demonstrated that in more than 90% of the simulations, pools present higher sires’ GEBV accuracy than individual genotypes when the number of genotype assays is limited (i.e., 110 or 220) and animals are assigned to pools based on phenotype. Pools assigned at random presented the poorest results (t1: 0.07–0.45, t2: 0.14–0.70). In conclusion, pooling by phenotype is the best approach to implementing genomic evaluation using commercial herd data, particularly when pools of 10 individuals are evaluated. While combining phenotypes seems a promising strategy to allow more flexibility to the estimates made using pools, more studies are necessary in this regard.

Download Full-text