scholarly journals A better design for stratified medicine based on genomic prediction

2016 ◽  
Author(s):  
S. Hong Lee ◽  
W.M. Shalanee P. Weerasinghe ◽  
Naomi R. Wray ◽  
Michael E. Goddard ◽  
Julius H.J. Van der Werf

ABSTRACTGenomic prediction shows promise for personalised medicine in which diagnosis and treatment are tailored to individuals based on their genetic profiles. Genomic prediction is arguably the greatest need for complex diseases and disorders for which both genetic and non-genetic factors contribute to risk. However, we have no adequate insight of the accuracy of such predictions, and how accuracy may vary between individuals or between populations. In this study, we present a theoretical framework to demonstrate that prediction accuracy can be maximised by targeting more informative individuals in a discovery set with closer relationships with the subjects, making prediction more similar to those in populations with small effective size (Ne). Increase of prediction accuracy from closer relationships is achieved under an additive model and does not rely on any interaction effects (gene × gene, gene × environment or gene × family). Using theory, simulations and real data analyses, we show that the predictive accuracy or the area under the receiver operating characteristic curve (AUC) increased exponentially with decreasing Ne. For example, with a set of realistic parameters (the sample size of discovery set N=3000 and heritability h2=0.5), AUC value approached to 0.9 (Ne=100) from 0.6 (Ne=10000), and the top percentile of the estimated genetic profile scores had 23 times higher proportion of cases than the general population (with Ne=100), which increased from 2 times higher proportion of cases (with Ne=10000). This suggests that different interventions in the top percentile risk groups maybe justified (i.e. stratified medicine). In conclusion, it is argued that there is considerable room to increase prediction accuracy for polygenic traits by using an efficient design of a smaller Ne (e.g. a design consisting of closer relationships) so that genomic prediction can be more beneficial in clinical applications in the near future.

2017 ◽  
Author(s):  
S. Hong Lee ◽  
Sam Clark ◽  
Julius H.J. van der Werf

ABSTRACTGenomic prediction is emerging in a wide range of fields including animal and plant breeding, risk prediction in human precision medicine and forensic. It is desirable to establish a theoretical framework for genomic prediction accuracy when the reference data consists of information sources with varying degrees of relationship to the target individuals. A reference set can contain both close and distant relatives as well as ‘unrelated’ individuals from the wider population in the genomic prediction. The various sources of information were modeled as different populations with different effective population sizes (Ne). Both the effective number of chromosome segments (Me) and Ne are considered to be a function of the data used for prediction. We validate our theory with analyses of simulated as well as real data, and illustrate that the variation in genomic relationships with the target is a predictor of the information content of the reference set. With a similar amount of data available for each source, we show that close relatives can have a substantially larger effect on genomic prediction accuracy than lesser related individuals. We also illustrate that when prediction relies on closer relatives, there is less improvement in prediction accuracy with an increase in training data or marker panel density. We release software that can estimate the expected prediction accuracy and power when combining different reference sources with various degrees of relationship to the target, which is useful when planning genomic prediction (before or after collecting data) in animal, plant and human genetics.


2020 ◽  
Vol 98 (Supplement_3) ◽  
pp. 27-27
Author(s):  
Junjie Han ◽  
Cedric Gondro ◽  
Juan Steibel

Abstract Deep learning (DL) is being used for prediction in precision livestock farming and in genomic prediction. However, optimizing hyperparameters in DL models is critical for their predictive performance. Grid search is the traditional approach to select hyperparameters in DL, but it requires exhaustive search over the parameter space. We propose hyperparameter selection using differential evolution (DE), which is a heuristic algorithm that does not require exhaustive search. The goal of this study was to design and apply DE to optimize hyperparameters of DL models for genomic prediction and image analysis in pig production systems. One dataset consisted of 910 pigs genotyped with 28,916 SNP markers to predict their post-mortem meat pH. Another dataset consisted of 1,334 images of pigs eating inside a single-spaced feeder classified as: “single pig” or “multiple pigs.” The accuracy of genomic prediction was defined as the correlation between the predicted pH and the observed pH. The image classification prediction accuracy was the proportion of correctly classified images. For genomic prediction, a multilayer perceptron (MLP) was optimized. For image classification, MLP and convolutional neural networks (CNN) were optimized. For genomic prediction, the initial hyperparameter set resulted in an accuracy of 0.032 and for image classification, the initial accuracy was between 0.72 and 0.76. After optimization using DE, the genomic prediction accuracy was 0.3688 compared to 0.334 using GBLUP. The top selected models included one layer, 60 neurons, sigmoid activation and L2 penalty = 0.3. The accuracy of image classification after optimization was between 0.89 and 0.92. Selected models included three layers, adamax optimizer and relu or elu activation for the MLP, and one layer, 64 filters and 5×5 filter size for the CNN. DE can adapt the hyperparameter selection to each problem, dataset and model, and it significantly increased prediction accuracy with minimal user input.


2021 ◽  
Author(s):  
Ao Zhang ◽  
Shan Chen ◽  
Zhenhai Cui ◽  
Yubo Liu ◽  
Yuan Guan ◽  
...  

Abstract Drought tolerance in maize is a complex and polygenic trait, especially in the seedling stage. In plant breeding, such traits can be improved by genomic selection (GS), which has become a practical and effective tool. In the present study, a natural maize population named Northeast China core population (NCCP) consisting of 379 inbred lines were genotyped with diversity arrays technology (DArT) and genotyping-by-sequencing (GBS) platforms. Target traits of seedling emergence rate (ER), seedling plant height (SPH), and grain yield (GY) were evaluated under two natural drought environments in northeast China. adequate genetic variants have been found for genomic selection, they are not stable enough between two years. Similarly, the heritability of the three traits is not stable enough, and the heritabilities in 2019 (0.88, 0.82, 0.85 for ER, SPH, GY) are higher than that in 2020 (0.65, 0.53, 0.33) and cross-two-year (0.32, 0.26, 0.33). The current research obtained two kinds of marker sets: the SilicoDArT markers were from DArT-seq, and SNPs were from the GBS and DArT-seq. In total, a number of 11,865 SilicoDArT, 7,837 DArT's SNPs, and 91,003 GBS SNPs were used for analysis after quality control. The results of phylogenetic trees showed that the population was rich in consanguinity. Genomic prediction results showed that the average prediction accuracies estimated using the DArT SNP dataset under the 2-fold cross-validation scheme were 0.27, 0.19, and 0.33, for ER, SPH, and GY, respectively. The result of SilicoDArT is close to the SNPs from DArT-seq, those were 0.26, 0.22, and 0.33. For SPH, the prediction accuracies using SilicoDArT were more than ones using DArT SNP, In some cases, alignment to the reference genome results in a loss to the prediction. The trait with lower heritability can improve the prediction accuracy using filtering of linkage disequilibrium. For the same trait, the prediction accuracy estimated with two types of DArT markers was consistently higher than those estimated with the GBS SNPs under the same genotyping cost. Our results show the prediction accuracy has been improved in some cases of controlling population structure and marker quality, even when the density of the marker is reduced. In the initial maize breeding cycle, Silicodart markers can obtain higher prediction accuracy with a lower cost. However, higher marker density platforms i.e. GBS may play a role in the following breeding cycle for the long term. The natural drought experimental station can reduce the difficulty of phenotypic identification in a water-scarce environment. The accumulation of more yearly data will help to stabilize the heritability and improve predictive accuracy in maize breeding. The experimental design and model for drought resistance also need to be further developed.


2021 ◽  
Author(s):  
Elaheh Vojgani ◽  
Torsten Pook ◽  
Armin C. Hölker ◽  
Manfred Mayer ◽  
Chris-Carolin Schön ◽  
...  

Abstract The importance of accurate genomic prediction of phenotypes in plant breeding is undeniable, as higher prediction accuracy can increase selection responses. In this study, we investigated the ability of three models to improve prediction accuracy by including phenotypic information from the last growing season. This was done by considering a single biological trait in two growing seasons (2017 and 2018) as separate traits in a multi-trait model. Thus, bivariate variants of the Genomic Best Linear Unbiased Prediction (GBLUP) as an additive model, Epistatic Random Regression BLUP (ERRBLUP) and selective Epistatic Random Regression BLUP (sERRBLUP) as epistasis models were compared with respect to their prediction accuracies for the second year. The results indicate that bivariate ERRBLUP is almost identical to bivariate GBLUP in prediction accuracy, while bivariate sERRBLUP has the highest prediction accuracy in most cases. The obtained prediction accuracies were similar when utilizing pruned sets of SNPs and haplotype blocks, while utilizing haplotype blocks reduces the computational load significantly compared to utilizing pruned sets of SNPs. The prediction accuracies of bivariate GBLUP, ERRBLUP and sERRBLUP have been assessed across eight phenotypic traits and studied datasets from 471/402 doubled haploid lines in the European maize landrace Kemater Landmais Gelb/Petkuser Ferdinand Rot. We further investigated the genomic correlation, phenotypic correlation and trait heritability as factors affecting the bivariate models’ prediction accuracy, with genetic correlation between growing seasons being the most important one. For all three considered model architectures results were far worse when using a univariate version of the model.


Author(s):  
Elaheh Vojgani ◽  
Torsten Pook ◽  
Johannes W.R. Martini ◽  
Armin C. Hölker ◽  
Manfred Mayer ◽  
...  

AbstractWe compared the predictive ability of various prediction models for a maize dataset derived from 910 doubled haploid lines from European landraces (Kemater Landmais Gelb and Petkuser Ferdinand Rot), which were tested in six locations in Germany and Spain. The compared models were Genomic Best Linear Unbiased Prediction (GBLUP) as an additive model, Epistatic Random Regression BLUP (ERRBLUP) accounting for all pairwise SNP interactions, and selective Epistatic Random Regression BLUP (sERRBLUP) accounting for a selected subset of pairwise SNP interactions. These models have been compared in both univariate and bivariate statistical settings within and across environments. Our results indicate that modeling all pairwise SNP interactions into the univariate/bivariate model (ERRBLUP) is not superior in predictive ability to the respective additive model (GBLUP). However, incorporating only a selected subset of interactions with the highest effect variances in univariate/bivariate sERRBLUP can increase predictive ability significantly compared to the univariate/bivariate GBLUP. Overall, bivariate models consistently outperform univariate models in predictive ability. Over all studied traits, locations, and landraces, the increase in prediction accuracy from univariate GBLUP to univariate sERRBLUP ranged from 5.9 to 112.4 percent, with an average increase of 47 percent. For bivariate models, the change ranged from −0.3 to +27.9 percent comparing the bivariate sERRBLUP to the bivariate GBLUP. The average increase across traits and locations was 11 percent. This considerable increase in predictive ability achieved by sERRBLUP may be of interest for “sparse testing” approaches in which only a subset of the lines/hybrids of interest is observed at each location.Key MessageThe prediction accuracy of genomic prediction of phenotypes can be increased by only including top ranked pairwise SNP interactions into the prediction models.


2020 ◽  
Author(s):  
Elaheh Vojgani ◽  
Torsten Pook ◽  
Armin C. Hölker ◽  
Manfred Mayer ◽  
Chris-Carolin Schön ◽  
...  

AbstractThe importance of accurate genomic prediction of phenotypes in plant breeding is undeniable, as higher prediction accuracy can increase selection responses. In this study, we investigated the ability of three models to improve prediction accuracy by including phenotypic information from the last growing season. This was done by considering a single biological trait in two growing seasons (2017 and 2018) as separate traits in a multi-trait model. Thus, bivariate variants of the Genomic Best Linear Unbiased Prediction (GBLUP) as an additive model, Epistatic Random Regression BLUP (ERRBLUP) and selective Epistatic Random Regression BLUP (sERRBLUP) as epistasis models were compared with respect to their prediction accuracies for the second year. The results indicate that bivariate ERRBLUP is slightly superior to bivariate GBLUP in predication accuracy, while bivariate sERRBLUP has the highest prediction accuracy in most cases. The average relative increase in prediction accuracy from bivariate GBLUP to maximum bivariate sERRBLUP across eight phenotypic traits and studied dataset from 471/402 doubled haploid lines in the European maize landrace Kemater Landmais Gelb/Petkuser Ferdinand Rot, were 7.61 and 3.47 percent, respectively. We further investigated the genomic correlation, phenotypic correlation and trait heritability as the factors affecting the bivariate model’s predication accuracy, with genetic correlation between growing seasons being the most important one. For all three considered model architectures results were far worse when using a univariate version of the model, e.g. with an average reduction in prediction accuracy of 0.23/0.14 for Kemater/Petkuser when using univariate GBLUP.Key MassageBivariate models based on selected subsets of pairwise SNP interactions can increase the prediction accuracy by utilizing phenotypic data across years under the assumption of high genomic correlation across years.


Author(s):  
Saheb Foroutaifar

AbstractThe main objectives of this study were to compare the prediction accuracy of different Bayesian methods for traits with a wide range of genetic architecture using simulation and real data and to assess the sensitivity of these methods to the violation of their assumptions. For the simulation study, different scenarios were implemented based on two traits with low or high heritability and different numbers of QTL and the distribution of their effects. For real data analysis, a German Holstein dataset for milk fat percentage, milk yield, and somatic cell score was used. The simulation results showed that, with the exception of the Bayes R, the other methods were sensitive to changes in the number of QTLs and distribution of QTL effects. Having a distribution of QTL effects, similar to what different Bayesian methods assume for estimating marker effects, did not improve their prediction accuracy. The Bayes B method gave higher or equal accuracy rather than the rest. The real data analysis showed that similar to scenarios with a large number of QTLs in the simulation, there was no difference between the accuracies of the different methods for any of the traits.


2021 ◽  
Vol 11 (11) ◽  
pp. 5043
Author(s):  
Xi Chen ◽  
Bo Kang ◽  
Jefrey Lijffijt ◽  
Tijl De Bie

Many real-world problems can be formalized as predicting links in a partially observed network. Examples include Facebook friendship suggestions, the prediction of protein–protein interactions, and the identification of hidden relationships in a crime network. Several link prediction algorithms, notably those recently introduced using network embedding, are capable of doing this by just relying on the observed part of the network. Often, whether two nodes are linked can be queried, albeit at a substantial cost (e.g., by questionnaires, wet lab experiments, or undercover work). Such additional information can improve the link prediction accuracy, but owing to the cost, the queries must be made with due consideration. Thus, we argue that an active learning approach is of great potential interest and developed ALPINE (Active Link Prediction usIng Network Embedding), a framework that identifies the most useful link status by estimating the improvement in link prediction accuracy to be gained by querying it. We proposed several query strategies for use in combination with ALPINE, inspired by the optimal experimental design and active learning literature. Experimental results on real data not only showed that ALPINE was scalable and boosted link prediction accuracy with far fewer queries, but also shed light on the relative merits of the strategies, providing actionable guidance for practitioners.


Genetics ◽  
2021 ◽  
Author(s):  
Marco Lopez-Cruz ◽  
Gustavo de los Campos

Abstract Genomic prediction uses DNA sequences and phenotypes to predict genetic values. In homogeneous populations, theory indicates that the accuracy of genomic prediction increases with sample size. However, differences in allele frequencies and in linkage disequilibrium patterns can lead to heterogeneity in SNP effects. In this context, calibrating genomic predictions using a large, potentially heterogeneous, training data set may not lead to optimal prediction accuracy. Some studies tried to address this sample size/homogeneity trade-off using training set optimization algorithms; however, this approach assumes that a single training data set is optimum for all individuals in the prediction set. Here, we propose an approach that identifies, for each individual in the prediction set, a subset from the training data (i.e., a set of support points) from which predictions are derived. The methodology that we propose is a Sparse Selection Index (SSI) that integrates Selection Index methodology with sparsity-inducing techniques commonly used for high-dimensional regression. The sparsity of the resulting index is controlled by a regularization parameter (λ); the G-BLUP (the prediction method most commonly used in plant and animal breeding) appears as a special case which happens when λ = 0. In this study, we present the methodology and demonstrate (using two wheat data sets with phenotypes collected in ten different environments) that the SSI can achieve significant (anywhere between 5-10%) gains in prediction accuracy relative to the G-BLUP.


BMC Genomics ◽  
2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Akio Onogi ◽  
Toshio Watanabe ◽  
Atsushi Ogino ◽  
Kazuhito Kurogi ◽  
Kenji Togashi

Abstract Background Genomic prediction is now an essential technology for genetic improvement in animal and plant breeding. Whereas emphasis has been placed on predicting the breeding values, the prediction of non-additive genetic effects has also been of interest. In this study, we assessed the potential of genomic prediction using non-additive effects for phenotypic prediction in Japanese Black, a beef cattle breed. In addition, we examined the stability of variance component and genetic effect estimates against population size by subsampling with different sample sizes. Results Records of six carcass traits, namely, carcass weight, rib eye area, rib thickness, subcutaneous fat thickness, yield rate and beef marbling score, for 9850 animals were used for analyses. As the non-additive genetic effects, dominance, additive-by-additive, additive-by-dominance and dominance-by-dominance effects were considered. The covariance structures of these genetic effects were defined using genome-wide SNPs. Using single-trait animal models with different combinations of genetic effects, it was found that 12.6–19.5 % of phenotypic variance were occupied by the additive-by-additive variance, whereas little dominance variance was observed. In cross-validation, adding the additive-by-additive effects had little influence on predictive accuracy and bias. Subsampling analyses showed that estimation of the additive-by-additive effects was highly variable when phenotypes were not available. On the other hand, the estimates of the additive-by-additive variance components were less affected by reduction of the population size. Conclusions The six carcass traits of Japanese Black cattle showed moderate or relatively high levels of additive-by-additive variance components, although incorporating the additive-by-additive effects did not improve the predictive accuracy. Subsampling analysis suggested that estimation of the additive-by-additive effects was highly reliant on the phenotypic values of the animals to be estimated, as supported by low off-diagonal values of the relationship matrix. On the other hand, estimates of the additive-by-additive variance components were relatively stable against reduction of the population size compared with the estimates of the corresponding genetic effects.


Sign in / Sign up

Export Citation Format

Share Document