scholarly journals Estimation of genomic prediction accuracy from reference populations with varying degrees of relationship

2017 ◽  
Author(s):  
S. Hong Lee ◽  
Sam Clark ◽  
Julius H.J. van der Werf

ABSTRACTGenomic prediction is emerging in a wide range of fields including animal and plant breeding, risk prediction in human precision medicine and forensic. It is desirable to establish a theoretical framework for genomic prediction accuracy when the reference data consists of information sources with varying degrees of relationship to the target individuals. A reference set can contain both close and distant relatives as well as ‘unrelated’ individuals from the wider population in the genomic prediction. The various sources of information were modeled as different populations with different effective population sizes (Ne). Both the effective number of chromosome segments (Me) and Ne are considered to be a function of the data used for prediction. We validate our theory with analyses of simulated as well as real data, and illustrate that the variation in genomic relationships with the target is a predictor of the information content of the reference set. With a similar amount of data available for each source, we show that close relatives can have a substantially larger effect on genomic prediction accuracy than lesser related individuals. We also illustrate that when prediction relies on closer relatives, there is less improvement in prediction accuracy with an increase in training data or marker panel density. We release software that can estimate the expected prediction accuracy and power when combining different reference sources with various degrees of relationship to the target, which is useful when planning genomic prediction (before or after collecting data) in animal, plant and human genetics.


Author(s):  
Saheb Foroutaifar

AbstractThe main objectives of this study were to compare the prediction accuracy of different Bayesian methods for traits with a wide range of genetic architecture using simulation and real data and to assess the sensitivity of these methods to the violation of their assumptions. For the simulation study, different scenarios were implemented based on two traits with low or high heritability and different numbers of QTL and the distribution of their effects. For real data analysis, a German Holstein dataset for milk fat percentage, milk yield, and somatic cell score was used. The simulation results showed that, with the exception of the Bayes R, the other methods were sensitive to changes in the number of QTLs and distribution of QTL effects. Having a distribution of QTL effects, similar to what different Bayesian methods assume for estimating marker effects, did not improve their prediction accuracy. The Bayes B method gave higher or equal accuracy rather than the rest. The real data analysis showed that similar to scenarios with a large number of QTLs in the simulation, there was no difference between the accuracies of the different methods for any of the traits.



Genetics ◽  
2021 ◽  
Author(s):  
Marco Lopez-Cruz ◽  
Gustavo de los Campos

Abstract Genomic prediction uses DNA sequences and phenotypes to predict genetic values. In homogeneous populations, theory indicates that the accuracy of genomic prediction increases with sample size. However, differences in allele frequencies and in linkage disequilibrium patterns can lead to heterogeneity in SNP effects. In this context, calibrating genomic predictions using a large, potentially heterogeneous, training data set may not lead to optimal prediction accuracy. Some studies tried to address this sample size/homogeneity trade-off using training set optimization algorithms; however, this approach assumes that a single training data set is optimum for all individuals in the prediction set. Here, we propose an approach that identifies, for each individual in the prediction set, a subset from the training data (i.e., a set of support points) from which predictions are derived. The methodology that we propose is a Sparse Selection Index (SSI) that integrates Selection Index methodology with sparsity-inducing techniques commonly used for high-dimensional regression. The sparsity of the resulting index is controlled by a regularization parameter (λ); the G-BLUP (the prediction method most commonly used in plant and animal breeding) appears as a special case which happens when λ = 0. In this study, we present the methodology and demonstrate (using two wheat data sets with phenotypes collected in ten different environments) that the SSI can achieve significant (anywhere between 5-10%) gains in prediction accuracy relative to the G-BLUP.



2021 ◽  
Author(s):  
Evans K. Cheruiyot ◽  
Mekonnen Haile-Mariam ◽  
Benjamin G. Cocks ◽  
Iona M. MacLeod ◽  
Raphael Mrode ◽  
...  

Abstract Background Heat tolerance is a trait of economic importance in the context of warm climates and the effects of global warming on livestock, production, reproduction, health, and well-being. It is desirable to improve the prediction accuracy for heat tolerance to help accelerate the genetic gain for this trait. This study investigated the improvement in prediction accuracy for heat tolerance when selected sets of sequence variants from a large genome-wide association study (GWAS) were incorporated into a standard 50k SNP panel used by the industry. Methods Over 40,000 dairy cattle (Holsteins, Jersey, and crossbreds) with genotype and phenotype data were analysed. The phenotypes used to measure an individual’s heat tolerance were defined as the rate of milk production decline (slope traits for the yield of milk, fat, and protein) with a rising temperature-humidity index. We used Holstein and Jersey cows to select sequence variants linked to heat tolerance based on GWAS. We then investigated the accuracy of prediction when sets of these pre-selected sequence variants were added to the 50k industry SNP array used routinely for genomic evaluations in Australia. We used a bull reference set to develop the genomic prediction equations and then validated them in an independent set of Holsteins, Jersey, and crossbred cows. The genomic prediction analyses were performed using BayesR and BayesRC methods. Results The accuracy of genomic prediction for heat tolerance improved by up to 7%, 5%, and 10% in Holsteins, Jersey, and crossbred cows, respectively, when sets of selected sequence markers from Holsteins (i.e., single-breed QTL discovery set) were added to the 50k industry SNP panel. Using pre-selected sequence variants identified based on a combined set of Holstein and Jersey cows in a multi-breed QTL discovery, a set of 6,132 to 6,422 SNPs generally improved accuracy, especially in the Jersey validation set. Combining Holstein and Jersey bulls (multi-breed) in the reference set improved prediction accuracy compared to using only Holstein bulls in the reference set. Conclusion Informative sequence markers can be prioritised to improve the genetic prediction of heat tolerance in different breeds, and these variants, in addition to providing biological insight, have direct application in the development of customized SNP arrays or can be utilised via imputation into current SNP sets.



2019 ◽  
Author(s):  
Christos Palaiokostas ◽  
Tomas Vesely ◽  
Martin Kocour ◽  
Martin Prchal ◽  
Dagmar Pokorova ◽  
...  

AbstractGenomic selection (GS) is increasingly applied in breeding programmes of major aquaculture species, enabling improved prediction accuracy and genetic gain compared to pedigree-based approaches. Koi Herpesvirus disease (KHVD) is notifiable by the World Organisation for Animal Health and the European Union, causing major economic losses to carp production. Genomic selection has potential to breed carp with improved resistance to KHVD, thereby contributing to disease control. In the current study, Restriction-site Associated DNA sequencing (RAD-seq) was applied on a population of 1,425 common carp juveniles which had been challenged with Koi herpes virus, followed by sampling of survivors and mortalities. Genomic selection (GS) was tested on a wide range of scenarios by varying both SNP densities and the genetic relationships between training and validation sets. The accuracy of correctly identifying KHVD resistant animals using genomic selection was between 8 and 18 % higher than pedigree best linear unbiased predictor (pBLUP) depending on the tested scenario. Furthermore, minor decreases in prediction accuracy were observed with decreased SNP density. However, the genetic relationship between the training and validation sets was a key factor in the efficacy of genomic prediction of KHVD resistance in carp, with substantially lower prediction accuracy when the relationships between the training and validation sets did not contain close relatives.



Heredity ◽  
2021 ◽  
Author(s):  
Marco Lopez-Cruz ◽  
Yoseph Beyene ◽  
Manje Gowda ◽  
Jose Crossa ◽  
Paulino Pérez-Rodríguez ◽  
...  

AbstractGenomic prediction models are often calibrated using multi-generation data. Over time, as data accumulates, training data sets become increasingly heterogeneous. Differences in allele frequency and linkage disequilibrium patterns between the training and prediction genotypes may limit prediction accuracy. This leads to the question of whether all available data or a subset of it should be used to calibrate genomic prediction models. Previous research on training set optimization has focused on identifying a subset of the available data that is optimal for a given prediction set. However, this approach does not contemplate the possibility that different training sets may be optimal for different prediction genotypes. To address this problem, we recently introduced a sparse selection index (SSI) that identifies an optimal training set for each individual in a prediction set. Using additive genomic relationships, the SSI can provide increased accuracy relative to genomic-BLUP (GBLUP). Non-parametric genomic models using Gaussian kernels (KBLUP) have, in some cases, yielded higher prediction accuracies than standard additive models. Therefore, here we studied whether combining SSIs and kernel methods could further improve prediction accuracy when training genomic models using multi-generation data. Using four years of doubled haploid maize data from the International Maize and Wheat Improvement Center (CIMMYT), we found that when predicting grain yield the KBLUP outperformed the GBLUP, and that using SSI with additive relationships (GSSI) lead to 5–17% increases in accuracy, relative to the GBLUP. However, differences in prediction accuracy between the KBLUP and the kernel-based SSI were smaller and not always significant.



2015 ◽  
Vol 26 (4) ◽  
pp. 1867-1880
Author(s):  
Ilmari Ahonen ◽  
Denis Larocque ◽  
Jaakko Nevalainen

Outlier detection covers the wide range of methods aiming at identifying observations that are considered unusual. Novelty detection, on the other hand, seeks observations among newly generated test data that are exceptional compared with previously observed training data. In many applications, the general existence of novelty is of more interest than identifying the individual novel observations. For instance, in high-throughput cancer treatment screening experiments, it is meaningful to test whether any new treatment effects are seen compared with existing compounds. Here, we present hypothesis tests for such global level novelty. The problem is approached through a set of very general assumptions, making it innovative in relation to the current literature. We introduce test statistics capable of detecting novelty. They operate on local neighborhoods and their null distribution is obtained by the permutation principle. We show that they are valid and able to find different types of novelty, e.g. location and scale alternatives. The performance of the methods is assessed with simulations and with applications to real data sets.



2016 ◽  
Author(s):  
S. Hong Lee ◽  
W.M. Shalanee P. Weerasinghe ◽  
Naomi R. Wray ◽  
Michael E. Goddard ◽  
Julius H.J. Van der Werf

ABSTRACTGenomic prediction shows promise for personalised medicine in which diagnosis and treatment are tailored to individuals based on their genetic profiles. Genomic prediction is arguably the greatest need for complex diseases and disorders for which both genetic and non-genetic factors contribute to risk. However, we have no adequate insight of the accuracy of such predictions, and how accuracy may vary between individuals or between populations. In this study, we present a theoretical framework to demonstrate that prediction accuracy can be maximised by targeting more informative individuals in a discovery set with closer relationships with the subjects, making prediction more similar to those in populations with small effective size (Ne). Increase of prediction accuracy from closer relationships is achieved under an additive model and does not rely on any interaction effects (gene × gene, gene × environment or gene × family). Using theory, simulations and real data analyses, we show that the predictive accuracy or the area under the receiver operating characteristic curve (AUC) increased exponentially with decreasing Ne. For example, with a set of realistic parameters (the sample size of discovery set N=3000 and heritability h2=0.5), AUC value approached to 0.9 (Ne=100) from 0.6 (Ne=10000), and the top percentile of the estimated genetic profile scores had 23 times higher proportion of cases than the general population (with Ne=100), which increased from 2 times higher proportion of cases (with Ne=10000). This suggests that different interventions in the top percentile risk groups maybe justified (i.e. stratified medicine). In conclusion, it is argued that there is considerable room to increase prediction accuracy for polygenic traits by using an efficient design of a smaller Ne (e.g. a design consisting of closer relationships) so that genomic prediction can be more beneficial in clinical applications in the near future.



BMC Genomics ◽  
2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Yan Zhou ◽  
Bin Yang ◽  
Junhui Wang ◽  
Jiadi Zhu ◽  
Guoliang Tian

Abstract Background Identifying differentially expressed genes between the same or different species is an urgent demand for biological and medical research. For RNA-seq data, systematic technical effects and different sequencing depths are usually encountered when conducting experiments. Normalization is regarded as an essential step in the discovery of biologically important changes in expression. The present methods usually involve normalization of the data with a scaling factor, followed by detection of significant genes. However, more than one scaling factor may exist because of the complexity of real data. Consequently, methods that normalize data by a single scaling factor may deliver suboptimal performance or may not even work.The development of modern machine learning techniques has provided a new perspective regarding discrimination between differentially expressed (DE) and non-DE genes. However, in reality, the non-DE genes comprise only a small set and may contain housekeeping genes (in same species) or conserved orthologous genes (in different species). Therefore, the process of detecting DE genes can be formulated as a one-class classification problem, where only non-DE genes are observed, while DE genes are completely absent from the training data. Results In this study, we transform the problem to an outlier detection problem by treating DE genes as outliers, and we propose a scaling-free minimum enclosing ball (SFMEB) method to construct a smallest possible ball to contain the known non-DE genes in a feature space. The genes outside the minimum enclosing ball can then be naturally considered to be DE genes. Compared with the existing methods, the proposed SFMEB method does not require data normalization, which is particularly attractive when the RNA-seq data include more than one scaling factor. Furthermore, the SFMEB method could be easily extended to different species without normalization. Conclusions Simulation studies demonstrate that the SFMEB method works well in a wide range of settings, especially when the data are heterogeneous or biological replicates. Analysis of the real data also supports the conclusion that the SFMEB method outperforms other existing competitors. The R package of the proposed method is available at https://bioconductor.org/packages/MEB.



2021 ◽  
Vol 11 (11) ◽  
pp. 5043
Author(s):  
Xi Chen ◽  
Bo Kang ◽  
Jefrey Lijffijt ◽  
Tijl De Bie

Many real-world problems can be formalized as predicting links in a partially observed network. Examples include Facebook friendship suggestions, the prediction of protein–protein interactions, and the identification of hidden relationships in a crime network. Several link prediction algorithms, notably those recently introduced using network embedding, are capable of doing this by just relying on the observed part of the network. Often, whether two nodes are linked can be queried, albeit at a substantial cost (e.g., by questionnaires, wet lab experiments, or undercover work). Such additional information can improve the link prediction accuracy, but owing to the cost, the queries must be made with due consideration. Thus, we argue that an active learning approach is of great potential interest and developed ALPINE (Active Link Prediction usIng Network Embedding), a framework that identifies the most useful link status by estimating the improvement in link prediction accuracy to be gained by querying it. We proposed several query strategies for use in combination with ALPINE, inspired by the optimal experimental design and active learning literature. Experimental results on real data not only showed that ALPINE was scalable and boosted link prediction accuracy with far fewer queries, but also shed light on the relative merits of the strategies, providing actionable guidance for practitioners.



2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Daniel E. Runcie ◽  
Jiayi Qu ◽  
Hao Cheng ◽  
Lorin Crawford

AbstractLarge-scale phenotype data can enhance the power of genomic prediction in plant and animal breeding, as well as human genetics. However, the statistical foundation of multi-trait genomic prediction is based on the multivariate linear mixed effect model, a tool notorious for its fragility when applied to more than a handful of traits. We present , a statistical framework and associated software package for mixed model analyses of a virtually unlimited number of traits. Using three examples with real plant data, we show that can leverage thousands of traits at once to significantly improve genetic value prediction accuracy.



Sign in / Sign up

Export Citation Format

Share Document