scholarly journals Accuracy of gene expression prediction from genotype data with PrediXcan varies across diverse populations

2019 ◽  
Author(s):  
Anna Mikhaylova ◽  
Timothy Thornton

AbstractPredicting gene expression with genetic data has garnered significant attention in recent years. PrediXcan is one of the most widely used gene-based association methods for testing imputed gene expression values with a phenotype due to the invaluable insight the method has shown into the relationship between complex traits and the component of gene expression that can be attributed to genetic variation. The prediction models for PrediXcan, however, were obtained using supervised machine learning methods and training data from the Depression and Gene Network (DGN) and the Genotype-Tissue Expression (GTEx) data, where the majority of subjects are of European descent. Many genetic studies, however, include samples from multi-ethnic populations, and in this paper we assess the accuracy of gene expression predictions with PrediXcan in diverse populations. Using transcriptomic data from the GEUVADIS (Genetic European Variation in Health and Disease) RNA sequencing project and whole genome sequencing data from the 1000 Genomes project, we evaluate and compare the predictive performance of PrediXcan in an African population (Yoruban) and four European populations. Prediction results are obtained using a range of models from PrediXcan weight databases, and Pearson’s correlation coefficient is used to measure prediction accuracy. We demonstrate that the predictive performance of PrediXcan varies across populations (F-test p-value < 0.001), where prediction accuracy is the worst in the Yoruban sample compared to European samples. Moreover, the performance of PrediXcan varies not only among distant populations, but also among closely related populations as well. We also find that the qualitative performance of PrediXcan for the populations considered is consistent across all weight databases used.

2018 ◽  
Author(s):  
Lauren S Mogil ◽  
Angela Andaleon ◽  
Alexa Badalamenti ◽  
Scott P Dickinson ◽  
Xiuqing Guo ◽  
...  

For many complex traits, gene regulation is likely to play a crucial mechanistic role. How the genetic architectures of complex traits vary between populations and subsequent effects on genetic prediction are not well understood, in part due to the historical paucity of GWAS in populations of non-European ancestry. We used data from the MESA (Multi-Ethnic Study of Atherosclerosis) cohort to characterize the genetic architecture of gene expression within and between diverse populations. Genotype and monocyte gene expression were available in individuals with African American (AFA, n=233), Hispanic (HIS, n=352), and European (CAU, n=578) ancestry. We performed expression quantitative trait loci (eQTL) mapping in each population and show genetic correlation of gene expression depends on shared ancestry proportions. Using elastic net modeling with cross validation to optimize genotypic predictors of gene expression in each population, we show the genetic architecture of gene expression for most predictable genes is sparse. We found the best predicted gene, TACSTD2 , was the same across populations with R 2 > 0.86 in each population. However, we identified a subset of genes that are well-predicted in one population, but poorly predicted in another. We show these differences in predictive performance are due to allele frequency differences between populations. Using genotype weights trained in MESA to predict gene expression in independent populations showed that a training set with ancestry similar to the test set is better at predicting gene expression in test populations, demonstrating an urgent need for diverse population sampling in genomics. Our predictive models and performance statistics in diverse cohorts are made publicly available for use in transcriptome mapping methods at https://github.com/WheelerLab/DivPop.


2018 ◽  
Author(s):  
Carla Márquez-Luna ◽  
Steven Gazal ◽  
Po-Ru Loh ◽  
Samuel S. Kim ◽  
Nicholas Furlotte ◽  
...  

AbstractGenetic variants in functional regions of the genome are enriched for complex trait heritability. Here, we introduce a new method for polygenic prediction, LDpred-funct, that leverages trait-specific functional priors to increase prediction accuracy. We fit priors using the recently developed baseline-LD model, which includes coding, conserved, regulatory and LD-related annotations. We analytically estimate posterior mean causal effect sizes and then use cross-validation to regularize these estimates, improving prediction accuracy for sparse architectures. LDpred-funct attained higher prediction accuracy than other polygenic prediction methods in simulations using real genotypes. We applied LDpred-funct to predict 21 highly heritable traits in the UK Biobank. We used association statistics from British-ancestry samples as training data (avg N=373K) and samples of other European ancestries as validation data (avg N=22K), to minimize confounding. LDpred-funct attained a +4.6% relative improvement in average prediction accuracy (avg prediction R2=0.144; highest R2=0.413 for height) compared to SBayesR (the best method that does not incorporate functional information). For height, meta-analyzing training data from UK Biobank and 23andMe cohorts (total N=1107K; higher heritability in UK Biobank cohort) increased prediction R2 to 0.431. Our results show that incorporating functional priors improves polygenic prediction accuracy, consistent with the functional architecture of complex traits.


Blood ◽  
2005 ◽  
Vol 106 (11) ◽  
pp. 502-502
Author(s):  
Bart Burington ◽  
John Shaughnessy ◽  
Bart Barlogie ◽  
Crowley John

Abstract The prognosis of patients with MM varies widely. High risk is best captured by cellular and molecular genetic features. Objective: to determine whether predictive power of baseline GEP and metaphase cytogenetic abnormalities (CA) could be improved by availability of GEP data obtained 48hr after single agent D or T, pre-therapy. A total of 668 patients were enrolled on TT2, 323 randomized to T and 345 without T (ASCO 05). When randomized to T/no T, Baseline and 48 hr GEP samples were obtained from 32/41 receiving a test dose of T/D vs 10/14 receiving full VAD+T/VAD regimen. A total of 97 baseline/early treatment GEP pairs were analyzed. Combined baseline expression and 48hr expression changes of 151 genes predicted EFS at a false discovery rate (FDR) of 10%. The table compares baseline EFS high-risk dysregulation to the direction of 48 hour changes, confering improved EFS. Decreases over 48 hours are associated with improved EFS in 74 of 78 genes (upregulated expression confers poor survival at baseline). In the remaining 4, perturbation in the direction of an additional increase may be a marker of early response. With EFS-associated genes, we trained 15 EFS prediction models using baseline expression and 15 prediction models using the change in expression between baseline and 48 hours. Training sets were random splits of 97 patients and baseline and change models separately predicted an EFS risk index in the remaining validation patients (standardized to a variance of 1). Risk indices were compared to an indicator of cytogenetic abnormalities (CA) among validation patients using multivariate proportional hazards analyses. The table shows median hazard ratios and p-values for competing predictors in 15 validation sets. Without cytogenetics, combined GEP baseline and change indices were significant predictors in all 15 validation sets (median combined P-value of 0.002). The table shows median performing GEP model of 15 in a multivariate analysis including cytogenetics for all 97 patients. 48 hour changes in gene expression in newly diagnosed myeloma patients can significantly predict EFS in validated prediction models, alone and in combination with baseline GEP. After adjustment for baseline and 48-hour GEP change indices, metaphase cytogenetics is no longer a significant predictor in independent patient samples. Baseline EFS risk and 48 hour changes associated with good outcome in 151 EFS-associated genes. Improved EFS Decrease (HR&gt;=1 Increase (HR &lt;1) Baseline High Risk Downregulated (HR&lt;1) 6 67 Upregulated (HR&gt;+ 1) 74 4 Median Hazard Ratios and P-values for Multivariate Models in 15 Validation Sets HR P # of p-values below .05 (of 15) GEP baseline Risk 2.1 0.037 10 EP 48 hr change risk 1.9 0.052 7 CA 1.6 0.310 0 Median validation set overall P-value 0.0003 Median GEP EFS baseline/48 hour EFS prediction model n=97 HR P GEP baseline Risk 2.0 0.004 GEP 48 hr change risk 2.6 0.001 CA 1.4 0.330


2019 ◽  
Vol 37 (15_suppl) ◽  
pp. 3114-3114
Author(s):  
Umesh Kathad ◽  
Yuvanesh Vedaraju ◽  
Aditya Kulkarni ◽  
Gregory Tobin ◽  
Panna Sharma

3114 Background: The Response Algorithm for Drug positioning and Rescue (RADR) technology is Lantern Pharma's proprietary Artificial Intelligence (Al)-based machine learning approach for biomarker identification and patient stratification. RADR is a combination of three automated modules working sequentially to generate drug- and tumor type-specific gene signatures predictive of response. Methods: RADR integrates genomics, drug sensitivity and systems biology inputs with supervised machine learning strategies and generates gene expression-based responder/ non-responder profiles for specific tumor indications with high accuracy, in addition to identification of new correlations of genetic biomarkers with drug activity. Pre-treatment patient gene expression profiles along with corresponding treatment outcomes were used as algorithm inputs. Model training was typically performed using an initial set of genes derived from cancer cell line data when available, and further applied to patient data for model tuning, cross-validation and final gene signature development. Model testing and performance computation were carried out on patient records held out as blinded datasets. Response prediction accuracy and sensitivity were among the model performance metrics calculated. Results: On average, RADR achieved a response prediction accuracy of 80% during clinical validation. We present retrospective analyses performed as part of RADR validation using more than 10 independent datasets of patients from selected cancer types treated with approved drugs including chemotherapy, targeted therapy and immunotherapy agents. For an instance, the application of the RADR program to a Paclitaxel trial in breast cancer patients could have potentially reduced the number of patients in the treatment arm from 92 unselected patients to 24 biomarker-selected patients to produce the same number of responders. Also, we cite published evidence correlating genes from RADR derived biomarkers with increased Paclitaxel sensitivity in breast cancer. Conclusions: The value of RADR platform architecture is derived from its validation through the analysis of about ~17 million oncology-specific clinical data points, and ~1000 patient records. By implementing unique biological, statistical and machine learning workflows, Lantern Pharma's RADR technology is capable of deriving robust biomarker panels for pre-selecting true responders for recruitment into clinical trials which may improve the success rate of oncology drug approvals.


2020 ◽  
Author(s):  
Peipei Wang ◽  
Bethany M. Moore ◽  
Sahra Uygun ◽  
Melissa D. Lehti-Shiu ◽  
Cornelius S. Barry ◽  
...  

AbstractPlant metabolites produced via diverse pathways are important for plant survival, human nutrition and medicine. However, the pathway memberships of most plant enzyme genes are unknown. While co-expression is useful for assigning genes to pathways, expression correlation may exist only under specific spatiotemporal and conditional contexts. Utilizing >600 expression values and similarity data combinations from tomato, three strategies for predicting membership in 85 pathways were explored: naive prediction (identifying pathways with the most similarly expressed genes), unsupervised and supervised learning. Optimal predictions for different pathways require distinct data combinations that, in some cases, are indicative of biological processes relevant to pathway functions. Naive prediction produced higher error rates compared with machine learning methods. In 52 pathways, unsupervised learning performed better than a supervised approach, which may be due to the limited availability of training data. Furthermore, using gene-to-pathway expression similarities led to prediction models that outperformed those based simply on gene expression levels. Our study highlights the need to extensively explore expression-based features and prediction strategies to maximize the accuracy of metabolic pathway membership assignment. We anticipate that the prediction framework outlined here can be applied to other species and also be used to improve plant pathway annotation.


2018 ◽  
Author(s):  
Min Wang ◽  
Timothy P Hancock ◽  
Amanda J. Chamberlain ◽  
Christy J. Vander Jagt ◽  
Jennie E Pryce ◽  
...  

AbstractBackgroundTopological association domains (TADs) are chromosomal domains characterised by frequent internal DNA-DNA interactions. The transcription factor CTCF binds to conserved DNA sequence patterns called CTCF binding motifs to either prohibit or facilitate chromosomal interactions. TADs and CTCF binding motifs control gene expression, but they are not yet well defined in the bovine genome. In this paper, we sought to improve the annotation of bovine TADs and CTCF binding motifs, and assess whether the new annotation can reduce the search space for cis-regulatory variants.ResultsWe used genomic synteny to map TADs and CTCF binding motifs from humans, mice, dogs and macaques to the bovine genome. We found that our mapped TADs exhibited the same hallmark properties of those sourced from experimental data, such as housekeeping gene, tRNA genes, CTCF binding motifs, SINEs, H3K4me3 and H3K27ac. Then we showed that runs of genes with the same pattern of allele-specific expression (ASE) (either favouring paternal or maternal allele) were often located in the same TAD or between the same conserved CTCF binding motifs. Analyses of variance showed that when averaged across all bovine tissues tested, TADs explained 14% of ASE variation (standard deviation, SD: 0.056), while CTCF explained 27% (SD: 0.078). Furthermore, we showed that the quantitative trait loci (QTLs) associated with gene expression variation (eQTLs) or ASE variation (aseQTLs), which were identified from mRNA transcripts from 141 lactating cows’ white blood and milk cells, were highly enriched at putative bovine CTCF binding motifs. The most significant aseQTL and eQTL for each genic target were located within the same TAD as the gene more often than expected (Chi-Squared test P-value ≤ 0.001).ConclusionsOur results suggest that genomic synteny can be used to functionally annotate conserved transcriptional components, and provides a tool to reduce the search space for causative regulatory variants in the bovine genome.


PLoS Genetics ◽  
2021 ◽  
Vol 17 (10) ◽  
pp. e1009568
Author(s):  
Anju Giri ◽  
Merritt Khaipho-Burch ◽  
Edward S. Buckler ◽  
Guillaume P. Ramstein

Genomic prediction typically relies on associations between single-site polymorphisms and traits of interest. This representation of genomic variability has been successful for predicting many complex traits. However, it usually cannot capture the combination of alleles in haplotypes and it has generated little insight about the biological function of polymorphisms. Here we present a novel and cost-effective method for imputing cis haplotype associated RNA expression (HARE), studied their transferability across tissues, and evaluated genomic prediction models within and across populations. HARE focuses on tightly linked cis acting causal variants in the immediate vicinity of the gene, while excluding trans effects from diffusion and metabolism. Therefore, HARE estimates were more transferrable across different tissues and populations compared to measured transcript expression. We also showed that HARE estimates captured one-third of the variation in gene expression. HARE estimates were used in genomic prediction models evaluated within and across two diverse maize panels–a diverse association panel (Goodman Association panel) and a large half-sib panel (Nested Association Mapping panel)–for predicting 26 complex traits. HARE resulted in up to 15% higher prediction accuracy than control approaches that preserved haplotype structure, suggesting that HARE carried functional information in addition to information about haplotype structure. The largest increase was observed when the model was trained in the Nested Association Mapping panel and tested in the Goodman Association panel. Additionally, HARE yielded higher within-population prediction accuracy as compared to measured expression values. The accuracy achieved by measured expression was variable across tissues, whereas accuracy by HARE was more stable across tissues. Therefore, imputing RNA expression of genes by haplotype is stable, cost-effective, and transferable across populations.


Heredity ◽  
2021 ◽  
Author(s):  
Marco Lopez-Cruz ◽  
Yoseph Beyene ◽  
Manje Gowda ◽  
Jose Crossa ◽  
Paulino Pérez-Rodríguez ◽  
...  

AbstractGenomic prediction models are often calibrated using multi-generation data. Over time, as data accumulates, training data sets become increasingly heterogeneous. Differences in allele frequency and linkage disequilibrium patterns between the training and prediction genotypes may limit prediction accuracy. This leads to the question of whether all available data or a subset of it should be used to calibrate genomic prediction models. Previous research on training set optimization has focused on identifying a subset of the available data that is optimal for a given prediction set. However, this approach does not contemplate the possibility that different training sets may be optimal for different prediction genotypes. To address this problem, we recently introduced a sparse selection index (SSI) that identifies an optimal training set for each individual in a prediction set. Using additive genomic relationships, the SSI can provide increased accuracy relative to genomic-BLUP (GBLUP). Non-parametric genomic models using Gaussian kernels (KBLUP) have, in some cases, yielded higher prediction accuracies than standard additive models. Therefore, here we studied whether combining SSIs and kernel methods could further improve prediction accuracy when training genomic models using multi-generation data. Using four years of doubled haploid maize data from the International Maize and Wheat Improvement Center (CIMMYT), we found that when predicting grain yield the KBLUP outperformed the GBLUP, and that using SSI with additive relationships (GSSI) lead to 5–17% increases in accuracy, relative to the GBLUP. However, differences in prediction accuracy between the KBLUP and the kernel-based SSI were smaller and not always significant.


2016 ◽  
Author(s):  
Heather E. Wheeler ◽  
Kaanan P. Shah ◽  
Jonathon Brenner ◽  
Tzintzuni Garcia ◽  
Keston Aquino-Michaels ◽  
...  

AbstractUnderstanding the genetic architecture of gene expression traits is key to elucidating the underlying mechanisms of complex traits. Here, for the first time, we perform a systematic survey of the heritability and the distribution of effect sizes across all representative tissues in the human body. We find that local h2 can be relatively well characterized with 59% of expressed genes showing significant h2 (FDR < 0.1) in the DGN whole blood cohort. However, current sample sizes (n ≤ 922) do not allow us to compute distal h2. Bayesian Sparse Linear Mixed Model (BSLMM) analysis provides strong evidence that the genetic contribution to local expression traits is dominated by a handful of genetic variants rather than by the collective contribution of a large number of variants each of modest size. In other words, the local architecture of gene expression traits is sparse rather than polygenic across all 40 tissues (from DGN and GTEx) examined. This result is confirmed by the sparsity of optimal performing gene expression predictors via elastic net modeling. To further explore the tissue context specificity, we decompose the expression traits into cross-tissue and tissue-specific components using a novel Orthogonal Tissue Decomposition (OTD) approach. Through a series of simulations we show that the cross-tissue and tissue-specific components are identifiable via OTD. Heritability and sparsity estimates of these derived expression phenotypes show similar characteristics to the original traits. Consistent properties relative to prior GTEx multi-tissue analysis results suggest that these traits reflect the expected biology. Finally, we apply this knowledge to develop prediction models of gene expression traits for all tissues. The prediction models, heritability, and prediction performance R2 for original and decomposed expression phenotypes are made publicly available (https://github.com/hakyimlab/PrediXcan).Author SummaryGene regulation is known to contribute to the underlying mechanisms of complex traits. The GTEx project has generated RNA-Seq data on hundreds of individuals across more than 40 tissues providing a comprehensive atlas of gene expression traits. Here, we systematically examined the local versus distant heritability as well as the sparsity versus polygenicity of protein coding gene expression traits in tissues across the entire human body. To determine tissue context specificity, we decomposed the expression levels into cross-tissue and tissue-specific components. Regardless of tissue type, we found that local heritability, but not distal heritability, can be well characterized with current sample sizes. We found that the distribution of effect sizes is more consistent with a sparse local architecture in all tissues. We also show that the cross-tissue and tissue-specific expression phenotypes constructed with our orthogonal tissue decomposition model recapitulate complex Bayesian multi-tissue analysis results. This knowledge was applied to develop prediction models of gene expression traits for all tissues, which we make publicly available.


GigaScience ◽  
2020 ◽  
Vol 9 (10) ◽  
Author(s):  
Magali Jaillard ◽  
Mattia Palmieri ◽  
Alex van Belkum ◽  
Pierre Mahé

Abstract Background Recent years have witnessed the development of several k-mer–based approaches aiming to predict phenotypic traits of bacteria on the basis of their whole-genome sequences. While often convincing in terms of predictive performance, the underlying models are in general not straightforward to interpret, the interplay between the actual genetic determinant and its translation as k-mers being generally hard to decipher. Results We propose a simple and computationally efficient strategy allowing one to cope with the high correlation inherent to k-mer–based representations in supervised machine learning models, leading to concise and easily interpretable signatures. We demonstrate the benefit of this approach on the task of predicting the antibiotic resistance profile of a Klebsiella pneumoniae strain from its genome, where our method leads to signatures defined as weighted linear combinations of genetic elements that can easily be identified as genuine antibiotic resistance determinants, with state-of-the-art predictive performance. Conclusions By enhancing the interpretability of genomic k-mer–based antibiotic resistance prediction models, our approach improves their clinical utility and hence will facilitate their adoption in routine diagnostics by clinicians and microbiologists. While antibiotic resistance was the motivating application, the method is generic and can be transposed to any other bacterial trait. An R package implementing our method is available at https://gitlab.com/biomerieux-data-science/clustlasso.


Sign in / Sign up

Export Citation Format

Share Document