scholarly journals Integration of multidimensional splicing data and GWAS summary statistics for risk gene discovery

2021 ◽  
Author(s):  
Ying Ji ◽  
Qiang Wei ◽  
Rui Chen ◽  
Quan Wang ◽  
Ran Tao ◽  
...  

AbstractA common strategy for the functional interpretation of genome-wide association study (GWAS) findings has been the integrative analysis of GWAS and expression data. Using this strategy, many association methods (e.g., PrediXcan and FUSION) have been successful in identifying trait-associated genes via mediating effects on RNA expression. However, these approaches often ignore the effects of splicing, which carries as much disease risk as expression. Compared to expression data, one challenge to detect associations using splicing data is the large multiple testing burden due to multidimensional splicing events within genes. Here, we introduce a multidimensional splicing gene (MSG) approach, which consists of two stages: 1) we use sparse canonical correlation analysis (sCCA) to construct latent canonical vectors (CVs) by identifying sparse linear combinations of genetic variants and splicing events that are maximally correlated with each other; and 2) we test for the association between the genetically regulated splicing CVs and the trait of interest using GWAS summary statistics. Simulations show that MSG has proper type I error control and substantial power gains over existing multidimensional expression analysis methods (i.e., S-MultiXcan, UTMOST, and sCCA+ACAT) under diverse scenarios. When applied to the Genotype-Tissue Expression Project data and GWAS summary statistics of 14 complex human traits, MSG identified on average 83%, 115%, and 223% more significant genes than sCCA+ACAT, S-MultiXcan, and UTMOST, respectively. We highlight MSG’s applications to Alzheimer’s disease, low-density lipoprotein cholesterol, and schizophrenia, and found that the majority of MSG-identified genes would have been missed from expression-based analyses. Our results demonstrate that aggregating splicing data through MSG can improve power in identifying gene-trait associations and help better understand the genetic risk of complex traits.Author summaryWhile genome-wide association studies (GWAS) have successfully mapped thousands of loci associated with complex traits, it remains difficult to identify which genes they regulate and in which biological contexts. This interpretation challenge has motivated the development of computational methods to prioritize causal genes at GWAS loci. Most available methods have focused on linking risk variants with differential gene expression. However, genetic control of splicing and expression are comparable in their complex trait risk, and few studies have focused on identifying causal genes using splicing information. To study splicing mediated effects, one important statistical challenge is the large multiple testing burden generated from multidimensional splicing events. In this study, we develop a new approach, MSG, to test the mediating role of splicing variation on complex traits. We integrate multidimensional splicing data using sparse canonocial correlation analysis and then combine evidence for splicing-trait associations across features using a joint test. We show this approach has higher power to identify causal genes using splicing data than current state-of-art methods designed to model multidimensional expression data. We illustrate the benefits of our approach through extensive simulations and applications to real data sets of 14 complex traits.

Author(s):  
Zachary F Gerring ◽  
Angela Mina-Vargas ◽  
Eric R Gamazon ◽  
Eske M Derks

Abstract Motivation Genome-wide association studies have successfully identified multiple independent genetic loci that harbour variants associated with human traits and diseases, but the exact causal genes are largely unknown. Common genetic risk variants are enriched in non-protein-coding regions of the genome and often affect gene expression (expression quantitative trait loci, eQTL) in a tissue-specific manner. To address this challenge, we developed a methodological framework, E-MAGMA, which converts genome-wide association summary statistics into gene-level statistics by assigning risk variants to their putative genes based on tissue-specific eQTL information. Results We compared E-MAGMA to three eQTL informed gene-based approaches using simulated phenotype data. Phenotypes were simulated based on eQTL reference data using GCTA for all genes with at least one eQTL at chromosome 1. We performed 10 simulations per gene. The eQTL-h2 (i.e., the proportion of variation explained by the eQTLs) was set at 1%, 2%, and 5%. We found E-MAGMA outperforms other gene-based approaches across a range of simulated parameters (e.g. the number of identified causal genes). When applied to genome-wide association summary statistics for five neuropsychiatric disorders, E-MAGMA identified more putative candidate causal genes compared to other eQTL-based approaches. By integrating tissue-specific eQTL information, these results show E-MAGMA will help to identify novel candidate causal genes from genome-wide association summary statistics and thereby improve the understanding of the biological basis of complex disorders. Availability A tutorial and input files are made available in a github repository: https://github.com/eskederks/eMAGMA-tutorial. Supplementary information Supplementary data are available at Bioinformatics online.


Biostatistics ◽  
2017 ◽  
Vol 18 (3) ◽  
pp. 477-494 ◽  
Author(s):  
Jakub Pecanka ◽  
Marianne A. Jonker ◽  
Zoltan Bochdanovits ◽  
Aad W. Van Der Vaart ◽  

Summary For over a decade functional gene-to-gene interaction (epistasis) has been suspected to be a determinant in the “missing heritability” of complex traits. However, searching for epistasis on the genome-wide scale has been challenging due to the prohibitively large number of tests which result in a serious loss of statistical power as well as computational challenges. In this article, we propose a two-stage method applicable to existing case-control data sets, which aims to lessen both of these problems by pre-assessing whether a candidate pair of genetic loci is involved in epistasis before it is actually tested for interaction with respect to a complex phenotype. The pre-assessment is based on a two-locus genotype independence test performed in the sample of cases. Only the pairs of loci that exhibit non-equilibrium frequencies are analyzed via a logistic regression score test, thereby reducing the multiple testing burden. Since only the computationally simple independence tests are performed for all pairs of loci while the more demanding score tests are restricted to the most promising pairs, genome-wide association study (GWAS) for epistasis becomes feasible. By design our method provides strong control of the type I error. Its favourable power properties especially under the practically relevant misspecification of the interaction model are illustrated. Ready-to-use software is available. Using the method we analyzed Parkinson’s disease in four cohorts and identified possible interactions within several SNP pairs in multiple cohorts.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Jisu Shin ◽  
Sang Hong Lee

AbstractGenetic variation in response to the environment, that is, genotype-by-environment interaction (GxE), is fundamental in the biology of complex traits and diseases. However, existing methods are computationally demanding and infeasible to handle biobank-scale data. Here, we introduce GxEsum, a method for estimating the phenotypic variance explained by genome-wide GxE based on GWAS summary statistics. Through comprehensive simulations and analysis of UK Biobank with 288,837 individuals, we show that GxEsum can handle a large-scale biobank dataset with controlled type I error rates and unbiased GxE estimates, and its computational efficiency can be hundreds of times higher than existing GxE methods.


2015 ◽  
Author(s):  
Guo-Bo Chen ◽  
Sang Hong Lee ◽  
Matthew R Robinson ◽  
Maciej Trzaskowski ◽  
Zhi-Xiang Zhu ◽  
...  

Genome-wide association studies (GWASs) have been successful in discovering replicable SNP-trait associations for many quantitative traits and common diseases in humans. Typically the effect sizes of SNP alleles are very small and this has led to large genome-wide association meta-analyses (GWAMA) to maximize statistical power. A trend towards ever-larger GWAMA is likely to continue, yet dealing with summary statistics from hundreds of cohorts increases logistical and quality control problems, including unknown sample overlap, and these can lead to both false positive and false negative findings. In this study we propose a new set of metrics and visualization tools for GWAMA, using summary statistics from cohort-level GWASs. We proposed a pair of methods in examining the concordance between demographic information and summary statistics. In method I, we use the population genetics Fststatistic to verify the genetic origin of each cohort and their geographic location, and demonstrate using GWAMA data from the GIANT Consortium that geographic locations of cohorts can be recovered and outlier cohorts can be detected. In method II, we conduct principal component analysis based on reported allele frequencies, and is able to recover the ancestral information for each cohort. In addition, we propose a new statistic that uses the reported allelic effect sizes and their standard errors to identify significant sample overlap or heterogeneity between pairs of cohorts. Finally, to quantify unknown sample overlap across all pairs of cohorts we propose a method that uses randomly generated genetic predictors that does not require the sharing of individual-level genotype data and does not breach individual privacy.


2020 ◽  
Vol 15 (11) ◽  
pp. 1643-1656
Author(s):  
Adrienne Tin ◽  
Anna Köttgen

The past few years have seen major advances in genome-wide association studies (GWAS) of CKD and kidney function–related traits in several areas: increases in sample size from >100,000 to >1 million, enabling the discovery of >250 associated genetic loci that are highly reproducible; the inclusion of participants not only of European but also of non-European ancestries; and the use of advanced computational methods to integrate additional genomic and other unbiased, high-dimensional data to characterize the underlying genetic architecture and prioritize potentially causal genes and variants. Together with other large-scale biobank and genetic association studies of complex traits, these GWAS of kidney function–related traits have also provided novel insight into the relationship of kidney function to other diseases with respect to their genetic associations, genetic correlation, and directional relationships. A number of studies also included functional experiments using model organisms or cell lines to validate prioritized potentially causal genes and/or variants. In this review article, we will summarize these recent GWAS of CKD and kidney function–related traits, explain approaches for downstream characterization of associated genetic loci and the value of such computational follow-up analyses, and discuss related challenges along with potential solutions to ultimately enable improved treatment and prevention of kidney diseases through genetics.


2021 ◽  
Author(s):  
Giulia Muzio ◽  
Leslie O'Bray ◽  
Laetitia Meng-Papaxanthos ◽  
Juliane Klatt ◽  
Karsten Borgwardt

While the search for associations between genetic markers and complex traits has discovered tens of thousands of trait-related genetic variants, the vast majority of these only explain a tiny fraction of observed phenotypic variation. One possible strategy to detect stronger associations is to aggregate the effects of several genetic markers and to test entire genes, pathways or (sub)networks of genes for association to a phenotype. The latter, network-based genome-wide association studies, in particular suffers from a huge search space and an inherent multiple testing problem. As a consequence, current approaches are either based on greedy feature selection, thereby risking that they miss relevant associations, and/or neglect doing a multiple testing correction, which can lead to an abundance of false positive findings. To address the shortcomings of current approaches of network-based genome-wide association studies, we propose <tt>networkGWAS</tt>, a computationally efficient and statistically sound approach to gene-based genome-wide association studies based on mixed models and neighborhood aggregation. It allows for population structure correction and for well-calibrated p-values, which we obtain through a block permutation scheme. <tt>networkGWAS</tt> successfully detects known or plausible associations on simulated rare variants from H. sapiens data as well as semi-simulated and real data with common variants from A. thaliana and enables the systematic combination of gene-based genome-wide association studies with biological network information.


2017 ◽  
Author(s):  
Nicholas Mancuso ◽  
Gleb Kichaev ◽  
Huwenbo Shi ◽  
Malika Freund ◽  
Alexander Gusev ◽  
...  

AbstractTranscriptome-wide association studies (TWAS) using predicted expression have identified thousands of genes whose locally-regulated expression is associated to complex traits and diseases. In this work, we show that linkage disequilibrium (LD) among SNPs induce significant gene-trait associations at non-causal genes as a function of the overlap between eQTL weights used in expression prediction. We introduce a probabilistic framework that models the induced correlation among TWAS signals to assign a probability for every gene in the risk region to explain the observed association signal while controlling for pleiotropic SNP effects and unmeasured causal expression. Importantly, our approach remains accurate when expression data for causal genes are not available in the causal tissue by leveraging expression prediction from other tissues. Our approach yields credible-sets of genes containing the causal gene at a nominal confidence level (e.g., 90%) that can be used to prioritize and select genes for functional assays. We illustrate our approach using an integrative analysis of lipids traits where our approach prioritizes genes with strong evidence for causality.


2021 ◽  
Author(s):  
Xianghong Hu ◽  
Jia Zhao ◽  
Zhixiang Lin ◽  
Yang Wang ◽  
Heng Peng ◽  
...  

AbstractMendelian Randomization (MR) has proved to be a powerful tool for inferring causal relationships among a wide range of traits using GWAS summary statistics. Great efforts have been made to relax MR assumptions to account for confounding due to pleiotropy. Here we show that sample structure is another major confounding factor, including population stratification, cryptic relatedness, and sample overlap. We propose a unified MR approach, MR-APSS, to account for pleiotropy and sample structure simultaneously by leveraging genome-wide information. By further correcting bias in selecting genetic instruments, MR-APSS allows to include more genetic instruments with moderate effects to improve statistical power without inflating type I errors. We first evaluated MR-APSS using comprehensive simulations and negative controls, and then applied MR-APSS to study the causal relationships among a collection of diverse complex traits. The results suggest that MR-APSS can better identify plausible causal relationships with high reliability, in particular for highly polygenic traits.


2019 ◽  
Author(s):  
César-Reyer Vroom ◽  
Christiaan de Leeuw ◽  
Danielle Posthuma ◽  
Conor V. Dolan ◽  
Sophie van der Sluis

AbstractThe vast majority of genome-wide association (GWA) studies analyze a single trait while large-scale multivariate data sets are available. As complex traits are highly polygenic, and pleiotropy seems ubiquitous, it is essential to determine when multivariate association tests (MATs) outperform univariate approaches in terms of power. We discuss the statistical background of 19 MATs and give an overview of their statistical properties. We address the Type I error rates of these MATs and demonstrate which factors can cause bias. Finally, we examine, compare, and discuss the power of these MATs, varying the number of traits, the correlational pattern between the traits, the number of affected traits, and the sign of the genetic effects. Our results demonstrate under which circumstances specific MATs perform most optimal. Through sharing of flexible simulation scripts, we facilitate a standard framework for comparing Type I error rate and power of new MATs to that of existing ones.


Sign in / Sign up

Export Citation Format

Share Document