scholarly journals Evaluation of the accuracy of imputed sequence variants and their utility for causal variant detection in cattle

2016 ◽  
Author(s):  
Hubert Pausch ◽  
Iona M MacLeod ◽  
Ruedi Fries ◽  
Reiner Emmerling ◽  
Phil J Bowman ◽  
...  

AbstractBackgroundThe availability of dense genotypes and whole-genome sequence variants from various sources offers the opportunity to compile large data sets consisting of tens of thousands of individuals with genotypes at millions of polymorphic sites that may enhance the power of genomic analyses. The imputation of missing genotypes ensures that all individuals have genotypes for a shared set of variants.ResultsWe evaluated the accuracy of imputation from dense genotypes to whole-genome sequence variants in 249 Fleckvieh and 450 Holstein cattle using Minimac and FImpute. The sequence variants of a subset of the animals were reduced to the variants that were included in the Illumina BovineHD genotyping array and subsequently inferred in silico using either within-or multi-breed reference populations. The accuracy of imputation varied considerably across chromosomes and dropped at regions where the bovine genome contains segmental duplications. Depending on the imputation strategy, the correlation between imputed and true genotypes ranged from 0.898 to 0.952. The accuracy of imputation was higher with Minimac than FImpute particularly for variants with low MAF. Considering a multi-breed reference population increased the accuracy of imputation, particularly when FImpute was used to infer genotypes. When the sequence variants were imputed using Minimac, the true genotypes were more correlated to predicted allele dosages than best-guess genotypes. The computing costs to impute 23,256,743 sequence variants in 6958 animals were ten-fold higher with Minimac than FImpute. Association studies with imputed sequence variants revealed seven quantitative trait loci (QTL) for milk fat percentage. Two causal mutations in the DGAT1 and GHR genes were the most significantly associated variants at two QTL on chromosomes 14 and 20 when Minimac was used to infer genotypes.ConclusionsThe population-based imputation of millions of sequence variants in large cohorts is computationally feasible and provides accurate genotypes. However, the accuracy of imputation is low at regions where the genome contains large segmental duplications or the coverage with array-derived SNPs is poor. Using a reference population that includes individuals from many breeds increases the accuracy of imputation particularly at low-frequency variants. Considering allele dosages rather than best-guess genotypes as explanatory variables is advantageous to detect causal mutations in association studies with imputed sequence variants.

2015 ◽  
Author(s):  
Hubert Pausch ◽  
Reiner Emmerling ◽  
Hermann Schwarzenbacher ◽  
Ruedi Fries

Background: The availability of whole-genome sequence data from key ancestors provides an exhaustive catalogue of polymorphic sites segregating within and across cattle breeds. Sequence variants from key ancestors can be imputed in animals that have been genotyped using medium- and high-density genotyping arrays. Association analysis with imputed sequences, particularly if applied to multiple traits simultaneously, is a very powerful approach to revealing candidate causal variants underlying complex phenotypes. Results: We used whole-genome sequence data from 157 key ancestors of the German Fleckvieh population to impute 20 561 798 sequence variants in 10 363 animals that had (partly imputed) array-derived genotypes at 634 109 SNP. The imputed sequence data were enriched for rare variants. Association studies with imputed sequence variants were performed using seven correlated udder conformation traits as response variables. The calculation of an approximate multi-trait test statistic enabled us to detect twelve major QTL (P<2.97 x 10-9) controlling different aspects of mammary gland morphology. Imputed sequence variants were the most significantly associated at eleven QTL, whereas the top association signal at a QTL on BTA14 resulted from an array-derived variant. Seven QTL were associated with multiple phenotypes. Most QTL were located in non-coding regions of the genome in close neighborhood, however, to plausible candidate genes for mammary gland morphology (SP5, GC, NPFFR2, CRIM1, RXFP2, TBX5, RBM19, ADAM12). Conclusions: Association analysis with imputed sequence variants allows QTL characterization at maximum resolution. Multi-trait approaches can reveal QTL that are not detected in single-trait association studies. Most QTL for udder conformation traits were located in non-coding elements of the genome suggesting regulatory mutations to be the major determinants of variation in mammary gland morphology in cattle.


2021 ◽  
Vol 53 (1) ◽  
Author(s):  
Theo Meuwissen ◽  
Irene van den Berg ◽  
Mike Goddard

Abstract Background Whole-genome sequence (WGS) data are increasingly available on large numbers of individuals in animal and plant breeding and in human genetics through second-generation resequencing technologies, 1000 genomes projects, and large-scale genotype imputation from lower marker densities. Here, we present a computationally fast implementation of a variable selection genomic prediction method, that could handle WGS data on more than 35,000 individuals, test its accuracy for across-breed predictions and assess its quantitative trait locus (QTL) mapping precision. Methods The Monte Carlo Markov chain (MCMC) variable selection model (Bayes GC) fits simultaneously a genomic best linear unbiased prediction (GBLUP) term, i.e. a polygenic effect whose correlations are described by a genomic relationship matrix (G), and a Bayes C term, i.e. a set of single nucleotide polymorphisms (SNPs) with large effects selected by the model. Computational speed is improved by a Metropolis–Hastings sampling that directs computations to the SNPs, which are, a priori, most likely to be included into the model. Speed is also improved by running many relatively short MCMC chains. Memory requirements are reduced by storing the genotype matrix in binary form. The model was tested on a WGS dataset containing Holstein, Jersey and Australian Red cattle. The data contained 4,809,520 genotypes on 35,549 individuals together with their milk, fat and protein yields, and fat and protein percentage traits. Results The prediction accuracies of the Jersey individuals improved by 1.5% when using across-breed GBLUP compared to within-breed predictions. Using WGS instead of 600 k SNP-chip data yielded on average a 3% accuracy improvement for Australian Red cows. QTL were fine-mapped by locating the SNP with the highest posterior probability of being included in the model. Various QTL known from the literature were rediscovered, and a new SNP affecting milk production was discovered on chromosome 20 at 34.501126 Mb. Due to the high mapping precision, it was clear that many of the discovered QTL were the same across the five dairy traits. Conclusions Across-breed Bayes GC genomic prediction improved prediction accuracies compared to GBLUP. The combination of across-breed WGS data and Bayesian genomic prediction proved remarkably effective for the fine-mapping of QTL.


Genes ◽  
2021 ◽  
Vol 12 (11) ◽  
pp. 1830
Author(s):  
Victor B. Pedrosa ◽  
Flavio S. Schenkel ◽  
Shi-Yi Chen ◽  
Hinayah R. Oliveira ◽  
Theresa M. Casey ◽  
...  

Lactation persistency and milk production are among the most economically important traits in the dairy industry. In this study, we explored the association of over 6.1 million imputed whole-genome sequence variants with lactation persistency (LP), milk yield (MILK), fat yield (FAT), fat percentage (FAT%), protein yield (PROT), and protein percentage (PROT%) in North American Holstein cattle. We identified 49, 3991, 2607, 4459, 805, and 5519 SNPs significantly associated with LP, MILK, FAT, FAT%, PROT, and PROT%, respectively. Various known associations were confirmed while several novel candidate genes were also revealed, including ARHGAP35, NPAS1, TMEM160, ZC3H4, SAE1, ZMIZ1, PPIF, LDB2, ABI3, SERPINB6, and SERPINB9 for LP; NIM1K, ZNF131, GABRG1, GABRA2, DCHS1, and SPIDR for MILK; NR6A1, OLFML2A, EXT2, POLD1, GOT1, and ETV6 for FAT; DPP6, LRRC26, and the KCN gene family for FAT%; CDC14A, RTCA, HSTN, and ODAM for PROT; and HERC3, HERC5, LALBA, CCL28, and NEURL1 for PROT%. Most of these genes are involved in relevant gene ontology (GO) terms such as fatty acid homeostasis, transporter regulator activity, response to progesterone and estradiol, response to steroid hormones, and lactation. The significant genomic regions found contribute to a better understanding of the molecular mechanisms related to LP and milk production in North American Holstein cattle.


2018 ◽  
Vol 50 (1) ◽  
Author(s):  
Chunyan Zhang ◽  
Robert Alan Kemp ◽  
Paul Stothard ◽  
Zhiquan Wang ◽  
Nicholas Boddicker ◽  
...  

2018 ◽  
Vol 50 (5) ◽  
pp. 727-736 ◽  
Author(s):  
Donna M. Werling ◽  
Harrison Brand ◽  
Joon-Yong An ◽  
Matthew R. Stone ◽  
Lingxue Zhu ◽  
...  

2019 ◽  
Vol 51 (1) ◽  
Author(s):  
Nasir Moghaddar ◽  
Majid Khansefid ◽  
Julius H. J. van der Werf ◽  
Sunduimijid Bolormaa ◽  
Naomi Duijvesteijn ◽  
...  

Abstract Background Whole-genome sequence (WGS) data could contain information on genetic variants at or in high linkage disequilibrium with causative mutations that underlie the genetic variation of polygenic traits. Thus far, genomic prediction accuracy has shown limited increase when using such information in dairy cattle studies, in which one or few breeds with limited diversity predominate. The objective of our study was to evaluate the accuracy of genomic prediction in a multi-breed Australian sheep population of relatively less related target individuals, when using information on imputed WGS genotypes. Methods Between 9626 and 26,657 animals with phenotypes were available for nine economically important sheep production traits and all had WGS imputed genotypes. About 30% of the data were used to discover predictive single nucleotide polymorphism (SNPs) based on a genome-wide association study (GWAS) and the remaining data were used for training and validation of genomic prediction. Prediction accuracy using selected variants from imputed sequence data was compared to that using a standard array of 50k SNP genotypes, thereby comparing genomic best linear prediction (GBLUP) and Bayesian methods (BayesR/BayesRC). Accuracy of genomic prediction was evaluated in two independent populations that were each lowly related to the training set, one being purebred Merino and the other crossbred Border Leicester x Merino sheep. Results A substantial improvement in prediction accuracy was observed when selected sequence variants were fitted alongside 50k genotypes as a separate variance component in GBLUP (2GBLUP) or in Bayesian analysis as a separate category of SNPs (BayesRC). From an average accuracy of 0.27 in both validation sets for the 50k array, the average absolute increase in accuracy across traits with 2GBLUP was 0.083 and 0.073 for purebred and crossbred animals, respectively, whereas with BayesRC it was 0.102 and 0.087. The average gain in accuracy was smaller when selected sequence variants were treated in the same category as 50k SNPs. Very little improvement over 50k prediction was observed when using all WGS variants. Conclusions Accuracy of genomic prediction in diverse sheep populations increased substantially by using variants selected from whole-genome sequence data based on an independent multi-breed GWAS, when compared to genomic prediction using standard 50K genotypes.


2017 ◽  
Vol 100 (8) ◽  
pp. 6356-6370 ◽  
Author(s):  
Xiaoping Wu ◽  
Bernt Guldbrandtsen ◽  
Ulrik Sander Nielsen ◽  
Mogens Sandø Lund ◽  
Goutam Sahana

2019 ◽  
Author(s):  
Xin Zhou ◽  
Lu Zhang ◽  
Ziming Weng ◽  
David L. Dill ◽  
Arend Sidow

AbstractVariant discovery in personal, whole genome sequence data is critical for uncovering the genetic contributions to health and disease. We introduce a new approach, Aquila, that uses linked-read data for generating a high quality diploid genome assembly, from which it then comprehensively detects and phases personal genetic variation. Assemblies cover >95% of the human reference genome, with over 98% in a diploid state. Thus, the assemblies support detection and accurate genotyping of the most prevalent types of human genetic variation, including single nucleotide polymorphisms (SNPs), small insertions and deletions (small indels), and structural variants (SVs), in all but the most difficult regions. All heterozygous variants are phased in blocks that can approach arm-level length. The final output of Aquila is a diploid and phased personal genome sequence, and a phased VCF file that also contains homozygous and a few unphased heterozygous variants. Aquila represents a cost-effective evolution of whole-genome reconstruction that can be applied to cohorts for variation discovery or association studies, or to single individuals with rare phenotypes that could be caused by SVs or compound heterozygosity.


Sign in / Sign up

Export Citation Format

Share Document