scholarly journals Combinatorial and statistical prediction of gene expression from haplotype sequence

2020 ◽  
Vol 36 (Supplement_1) ◽  
pp. i194-i202
Author(s):  
Berk A Alpay ◽  
Pinar Demetci ◽  
Sorin Istrail ◽  
Derek Aguiar

Abstract Motivation Genome-wide association studies (GWAS) have discovered thousands of significant genetic effects on disease phenotypes. By considering gene expression as the intermediary between genotype and disease phenotype, expression quantitative trait loci studies have interpreted many of these variants by their regulatory effects on gene expression. However, there remains a considerable gap between genotype-to-gene expression association and genotype-to-gene expression prediction. Accurate prediction of gene expression enables gene-based association studies to be performed post hoc for existing GWAS, reduces multiple testing burden, and can prioritize genes for subsequent experimental investigation. Results In this work, we develop gene expression prediction methods that relax the independence and additivity assumptions between genetic markers. First, we consider gene expression prediction from a regression perspective and develop the HAPLEXR algorithm which combines haplotype clusterings with allelic dosages. Second, we introduce the new gene expression classification problem, which focuses on identifying expression groups rather than continuous measurements; we formalize the selection of an appropriate number of expression groups using the principle of maximum entropy. Third, we develop the HAPLEXD algorithm that models haplotype sharing with a modified suffix tree data structure and computes expression groups by spectral clustering. In both models, we penalize model complexity by prioritizing genetic clusters that indicate significant effects on expression. We compare HAPLEXR and HAPLEXD with three state-of-the-art expression prediction methods and two novel logistic regression approaches across five GTEx v8 tissues. HAPLEXD exhibits significantly higher classification accuracy overall; HAPLEXR shows higher prediction accuracy on approximately half of the genes tested and the largest number of best predicted genes (r2>0.1) among all methods. We show that variant and haplotype features selected by HAPLEXR are smaller in size than competing methods (and thus more interpretable) and are significantly enriched in functional annotations related to gene regulation. These results demonstrate the importance of explicitly modeling non-dosage dependent and intragenic epistatic effects when predicting expression. Availability and implementation Source code and binaries are freely available at https://github.com/rapturous/HAPLEX. Supplementary information Supplementary data are available at Bioinformatics online.

2020 ◽  
Vol 117 (26) ◽  
pp. 15028-15035 ◽  
Author(s):  
Ronald Yurko ◽  
Max G’Sell ◽  
Kathryn Roeder ◽  
Bernie Devlin

To correct for a large number of hypothesis tests, most researchers rely on simple multiple testing corrections. Yet, new methodologies of selective inference could potentially improve power while retaining statistical guarantees, especially those that enable exploration of test statistics using auxiliary information (covariates) to weight hypothesis tests for association. We explore one such method, adaptiveP-value thresholding (AdaPT), in the framework of genome-wide association studies (GWAS) and gene expression/coexpression studies, with particular emphasis on schizophrenia (SCZ). Selected SCZ GWAS associationPvalues play the role of the primary data for AdaPT; single-nucleotide polymorphisms (SNPs) are selected because they are gene expression quantitative trait loci (eQTLs). This natural pairing of SNPs and genes allow us to map the following covariate values to these pairs: GWAS statistics from genetically correlated bipolar disorder, the effect size of SNP genotypes on gene expression, and gene–gene coexpression, captured by subnetwork (module) membership. In all, 24 covariates per SNP/gene pair were included in the AdaPT analysis using flexible gradient boosted trees. We demonstrate a substantial increase in power to detect SCZ associations using gene expression information from the developing human prefrontal cortex. We interpret these results in light of recent theories about the polygenic nature of SCZ. Importantly, our entire process for identifying enrichment and creating features with independent complementary data sources can be implemented in many different high-throughput settings to ultimately improve power.


2019 ◽  
Author(s):  
Ronald Yurko ◽  
Max G’Sell ◽  
Kathryn Roeder ◽  
Bernie Devlin

AbstractTo correct for a large number of hypothesis tests, most researchers rely on simple multiple testing corrections. Yet, new methodologies of selective inference could potentially improve power while retaining statistical guarantees, especially those that enable exploration of test statistics using auxiliary information (covariates) to weight hypothesis tests for association. We explore one such method, adaptive p-value thresholding (Lei & Fithian 2018, AdaPT), in the framework of genome-wide association studies (GWAS) and gene expression/coexpression studies, with particular emphasis on schizophrenia (SCZ). Selected SCZ GWAS association p-values play the role of the primary data for AdaPT; SNPs are selected because they are gene expression quantitative trait loci (eQTLs). This natural pairing of SNPs and genes allow us to map the following covariate values to these pairs: GWAS statistics from genetically-correlated bipolar disorder, the effect size of SNP genotypes on gene expression, and gene-gene coexpression, captured by subnetwork (module) membership. In all 24 covariates per SNP/gene pair were included in the AdaPT analysis using flexible gradient boosted trees. We demonstrate a substantial increase in power to detect SCZ associations using gene expression information from the developing human prefontal cortex (Werling et al. 2019). We interpret these results in light of recent theories about the polygenic nature of SCZ. Importantly, our entire process for identifying enrichment and creating features with independent complementary data sources can be implemented in many different high-throughput settings to ultimately improve power.


2017 ◽  
Author(s):  
Claudia Giambartolomei ◽  
Jimmy Zhenli Liu ◽  
Wen Zhang ◽  
Mads Hauberg ◽  
Huwenbo Shi ◽  
...  

AbstractMotivationMost genetic variants implicated in complex diseases by genome-wide association studies (GWAS) are non-coding, making it challenging to understand the causative genes involved in disease. Integrating external information such as quantitative trait locus (QTL) mapping of molecular traits (e.g., expression, methylation) is a powerful approach to identify the subset of GWAS signals explained by regulatory effects. In particular, expression QTLs (eQTLs) help pinpoint the responsible gene among the GWAS regions that harbor many genes, while methylation QTLs (mQTLs) help identify the epigenetic mechanisms that impact gene expression which in turn affect disease risk. In this work we propose multiple-trait-coloc (moloc), a Bayesian statistical framework that integrates GWAS summary data with multiple molecular QTL data to identify regulatory effects at GWAS risk loci.ResultsWe applied moloc to schizophrenia (SCZ) and eQTL/mQTL data derived from human brain tissue and identified 52 candidate genes that influence SCZ through methylation. Our method can be applied to any GWAS and relevant functional data to help prioritize disease associated genes.Availabilitymoloc is available for download as an R package (https://github.com/clagiamba/moloc). We also developed a web site to visualize the biological findings (icahn.mssm.edu/moloc). The browser allows searches by gene, methylation probe, and scenario of [email protected] informationSupplementary data are available at Bioinformatics online.


2017 ◽  
Author(s):  
Chris Chatzinakos ◽  
Donghyung Lee ◽  
Bradley T Webb ◽  
Vladimir I Vladimirov ◽  
Kenneth S Kendler ◽  
...  

AbstractMotivationTo increase detection power, researchers use gene level analysis methods to aggregate weak marker signals. Due to gene expression controlling biological processes, researchers proposed aggregating signals for expression Quantitative Trait Loci (eQTL). Most gene-level eQTL methods make statistical inferences based on i) summary statistics from genome-wide association studies (GWAS) and ii) linkage disequilibrium (LD) patterns from a relevant reference panel. While most such tools assume homogeneous cohorts, our Gene-level Joint Analysis of functional SNPs in Cosmopolitan Cohorts (JEPEGMIX) method accommodates cosmopolitan cohorts by using heterogeneous panels. However, JEPGMIX relies on brain eQTLs from older gene expression studies and does not adjust for background enrichment in GWAS signals.ResultsWe propose JEPEGMIX2, an extension of JEPEGMIX. When compared to JPEGMIX, it uses i) cis-eQTL SNPs from the latest expression studies and ii) brains specific (sub)tissues and tissues other than brain. JEPEGMIX2 also i) avoids accumulating averagely enriched polygenic information by adjusting for background enrichment and ii), to avoid an increase in false positive rates for studies with numerous highly enriched (above the background) genes, it outputs gene q-values based on Holm adjustment of [email protected] informationSupplementary material is available at Bioinformatics online.


Author(s):  
Qiuming Yao ◽  
Paolo Ferragina ◽  
Yakir Reshef ◽  
Guillaume Lettre ◽  
Daniel E Bauer ◽  
...  

Abstract Motivation Genome-wide association studies (GWAS) have identified thousands of common trait-associated genetic variants but interpretation of their function remains challenging. These genetic variants can overlap the binding sites of transcription factors (TFs) and therefore could alter gene expression. However, we currently lack a systematic understanding on how this mechanism contributes to phenotype. Results We present Motif-Raptor, a TF-centric computational tool that integrates sequence-based predictive models, chromatin accessibility, gene expression datasets and GWAS summary statistics to systematically investigate how TF function is affected by genetic variants. Given trait associated non-coding variants, Motif-Raptor can recover relevant cell types and critical TFs to drive hypotheses regarding their mechanism of action. We tested Motif-Raptor on complex traits such as rheumatoid arthritis and red blood cell count and demonstrated its ability to prioritize relevant cell types, potential regulatory TFs and non-coding SNPs which have been previously characterized and validated. Availability Motif-Raptor is freely available as a Python package at: https://github.com/pinellolab/MotifRaptor. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Nora Scherer ◽  
Peggy Sekula ◽  
Peter Pfaffelhuber ◽  
Pascal Schlosser

Abstract Motivation When performing genome-wide association studies conventionally the additive genetic model is used to explore whether a single nucleotide polymorphism (SNP) is associated with a quantitative trait. But for variants, which do not follow an intermediate mode of inheritance (MOI), the recessive or the dominant genetic model can have more power to detect associations and furthermore the MOI is important for downstream analyses and clinical interpretation. When multiple MOIs are modelled the question arises, which describes the true underlying MOI best. Results We developed an R-package allowing for the first time to determine study specific critical values when one of the three models is more informative than the other ones for a quantitative trait locus. The package allows for user-friendly simulations to determine these critical values with predefined minor allele frequencies and study sizes. For application scenarios with extensive multiple testing we integrated an interpolation functionality to determine critical values already based on a moderate number of random draws. Availability and implementation The R-package pgainsim is freely available for download on Github at https://github.com/genepi-freiburg/pgainsim. Supplementary information Supplementary data are available at Bioinformatics online.


Gut ◽  
2021 ◽  
pp. gutjnl-2020-323906
Author(s):  
Jue-Sheng Ong ◽  
Jiyuan An ◽  
Xikun Han ◽  
Matthew H Law ◽  
Priyanka Nandakumar ◽  
...  

ObjectiveGastro-oesophageal reflux disease (GERD) has heterogeneous aetiology primarily attributable to its symptom-based definitions. GERD genome-wide association studies (GWASs) have shown strong genetic overlaps with established risk factors such as obesity and depression. We hypothesised that the shared genetic architecture between GERD and these risk factors can be leveraged to (1) identify new GERD and Barrett’s oesophagus (BE) risk loci and (2) explore potentially heterogeneous pathways leading to GERD and oesophageal complications.DesignWe applied multitrait GWAS models combining GERD (78 707 cases; 288 734 controls) and genetically correlated traits including education attainment, depression and body mass index. We also used multitrait analysis to identify BE risk loci. Top hits were replicated in 23andMe (462 753 GERD cases, 24 099 BE cases, 1 484 025 controls). We additionally dissected the GERD loci into obesity-driven and depression-driven subgroups. These subgroups were investigated to determine how they relate to tissue-specific gene expression and to risk of serious oesophageal disease (BE and/or oesophageal adenocarcinoma, EA).ResultsWe identified 88 loci associated with GERD, with 59 replicating in 23andMe after multiple testing corrections. Our BE analysis identified seven novel loci. Additionally we showed that only the obesity-driven GERD loci (but not the depression-driven loci) were associated with genes enriched in oesophageal tissues and successfully predicted BE/EA.ConclusionOur multitrait model identified many novel risk loci for GERD and BE. We present strong evidence for a genetic underpinning of disease heterogeneity in GERD and show that GERD loci associated with depressive symptoms are not strong predictors of BE/EA relative to obesity-driven GERD loci.


Metabolites ◽  
2021 ◽  
Vol 11 (8) ◽  
pp. 513
Author(s):  
Grace H. Yang ◽  
Danielle A. Fontaine ◽  
Sukanya Lodh ◽  
Joseph T. Blumer ◽  
Avtar Roopra ◽  
...  

Transcription factor 19 (TCF19) is a gene associated with type 1 diabetes (T1DM) and type 2 diabetes (T2DM) in genome-wide association studies. Prior studies have demonstrated that Tcf19 knockdown impairs β-cell proliferation and increases apoptosis. However, little is known about its role in diabetes pathogenesis or the effects of TCF19 gain-of-function. The aim of this study was to examine the impact of TCF19 overexpression in INS-1 β-cells and human islets on proliferation and gene expression. With TCF19 overexpression, there was an increase in nucleotide incorporation without any change in cell cycle gene expression, alluding to an alternate process of nucleotide incorporation. Analysis of RNA-seq of TCF19 overexpressing cells revealed increased expression of several DNA damage response (DDR) genes, as well as a tightly linked set of genes involved in viral responses, immune system processes, and inflammation. This connectivity between DNA damage and inflammatory gene expression has not been well studied in the β-cell and suggests a novel role for TCF19 in regulating these pathways. Future studies determining how TCF19 may modulate these pathways can provide potential targets for improving β-cell survival.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Jamie W. Robinson ◽  
Richard M. Martin ◽  
Spiridon Tsavachidis ◽  
Amy E. Howell ◽  
Caroline L. Relton ◽  
...  

AbstractGenome-wide association studies (GWAS) have discovered 27 loci associated with glioma risk. Whether these loci are causally implicated in glioma risk, and how risk differs across tissues, has yet to be systematically explored. We integrated multi-tissue expression quantitative trait loci (eQTLs) and glioma GWAS data using a combined Mendelian randomisation (MR) and colocalisation approach. We investigated how genetically predicted gene expression affects risk across tissue type (brain, estimated effective n = 1194 and whole blood, n = 31,684) and glioma subtype (all glioma (7400 cases, 8257 controls) glioblastoma (GBM, 3112 cases) and non-GBM gliomas (2411 cases)). We also leveraged tissue-specific eQTLs collected from 13 brain tissues (n = 114 to 209). The MR and colocalisation results suggested that genetically predicted increased gene expression of 12 genes were associated with glioma, GBM and/or non-GBM risk, three of which are novel glioma susceptibility genes (RETREG2/FAM134A, FAM178B and MVB12B/FAM125B). The effect of gene expression appears to be relatively consistent across glioma subtype diagnoses. Examining how risk differed across 13 brain tissues highlighted five candidate tissues (cerebellum, cortex, and the putamen, nucleus accumbens and caudate basal ganglia) and four previously implicated genes (JAK1, STMN3, PICK1 and EGFR). These analyses identified robust causal evidence for 12 genes and glioma risk, three of which are novel. The correlation of MR estimates in brain and blood are consistently low which suggested that tissue specificity needs to be carefully considered for glioma. Our results have implicated genes yet to be associated with glioma susceptibility and provided insight into putatively causal pathways for glioma risk.


Sign in / Sign up

Export Citation Format

Share Document