CoMM-S2: a collaborative mixed model using summary statistics in transcriptome-wide association studies

Yi Yang; Xingjie Shi; Yuling Jiao; Jian Huang; Min Chen; Xiang Zhou; Lei Sun; Xinyi Lin; Can Yang; Jin Liu

doi:10.1093/bioinformatics/btz880

CoMM-S2: a collaborative mixed model using summary statistics in transcriptome-wide association studies

Bioinformatics ◽

10.1093/bioinformatics/btz880 ◽

2019 ◽

Vol 36 (7) ◽

pp. 2009-2016 ◽

Cited By ~ 6

Author(s):

Yi Yang ◽

Xingjie Shi ◽

Yuling Jiao ◽

Jian Huang ◽

Min Chen ◽

...

Keyword(s):

Gene Expression ◽

Genetic Variants ◽

Complex Traits ◽

Mixed Model ◽

Association Studies ◽

Gwas Data ◽

Supplementary Information ◽

Summary Statistics ◽

Individual Level ◽

The Relationship

Abstract Motivation Although genome-wide association studies (GWAS) have deepened our understanding of the genetic architecture of complex traits, the mechanistic links that underlie how genetic variants cause complex traits remains elusive. To advance our understanding of the underlying mechanistic links, various consortia have collected a vast volume of genomic data that enable us to investigate the role that genetic variants play in gene expression regulation. Recently, a collaborative mixed model (CoMM) was proposed to jointly interrogate genome on complex traits by integrating both the GWAS dataset and the expression quantitative trait loci (eQTL) dataset. Although CoMM is a powerful approach that leverages regulatory information while accounting for the uncertainty in using an eQTL dataset, it requires individual-level GWAS data and cannot fully make use of widely available GWAS summary statistics. Therefore, statistically efficient methods that leverages transcriptome information using only summary statistics information from GWAS data are required. Results In this study, we propose a novel probabilistic model, CoMM-S2, to examine the mechanistic role that genetic variants play, by using only GWAS summary statistics instead of individual-level GWAS data. Similar to CoMM which uses individual-level GWAS data, CoMM-S2 combines two models: the first model examines the relationship between gene expression and genotype, while the second model examines the relationship between the phenotype and the predicted gene expression from the first model. Distinct from CoMM, CoMM-S2 requires only GWAS summary statistics. Using both simulation studies and real data analysis, we demonstrate that even though CoMM-S2 utilizes GWAS summary statistics, it has comparable performance as CoMM, which uses individual-level GWAS data. Availability and implementation The implement of CoMM-S2 is included in the CoMM package that can be downloaded from https://github.com/gordonliu810822/CoMM. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

CoMM-S2: a collaborative mixed model using summary statistics in transcriptome-wide association studies

10.1101/652263 ◽

2019 ◽

Cited By ~ 2

Author(s):

Yi Yang ◽

Xingjie Shi ◽

Yuling Jiao ◽

Jian Huang ◽

Min Chen ◽

...

Keyword(s):

Gene Expression ◽

Genetic Variants ◽

Complex Traits ◽

Mixed Model ◽

Association Studies ◽

Gwas Data ◽

Supplementary Information ◽

Summary Statistics ◽

Individual Level ◽

The Relationship

AbstractMotivationAlthough genome-wide association studies (GWAS) have deepened our understanding of the genetic architecture of complex traits, the mechanistic links that underlie how genetic variants cause complex traits remains elusive. To advance our understanding of the underlying mechanistic links, various consortia have collected a vast volume of genomic data that enable us to investigate the role that genetic variants play in gene expression regulation. Recently, a collaborative mixed model (CoMM) [42] was proposed to jointly interrogate genome on complex traits by integrating both the GWAS dataset and the expression quantitative trait loci (eQTL) dataset. Although CoMM is a powerful approach that leverages regulatory information while accounting for the uncertainty in using an eQTL dataset, it requires individual-level GWAS data and cannot fully make use of widely available GWAS summary statistics. Therefore, statistically efficient methods that leverages transcriptome information using only summary statistics information from GWAS data are required.ResultsIn this study, we propose a novel probabilistic model, CoMM-S2, to examine the mechanistic role that genetic variants play, by using only GWAS summary statistics instead of individual-level GWAS data. Similar to CoMM which uses individual-level GWAS data, CoMM-S2 combines two models: the first model examines the relationship between gene expression and genotype, while the second model examines the relationship between the phenotype and the predicted gene expression from the first model. Distinct from CoMM, CoMM-S2 requires only GWAS summary statistics. Using both simulation studies and real data analysis, we demonstrate that even though CoMM-S2 utilizes GWAS summary statistics, it has comparable performance as CoMM, which uses individual-level GWAS [email protected] and implementationThe implement of CoMM-S2 is included in the CoMM package that can be downloaded from https://github.com/gordonliu810822/CoMM.Supplementary informationSupplementary data are available at Bioinformatics online.

Download Full-text

Motif-Raptor: A Cell Type-Specific and Transcription Factor Centric Approach for Post-GWAS Prioritization of Causal Regulators

Bioinformatics ◽

10.1093/bioinformatics/btab072 ◽

2021 ◽

Author(s):

Qiuming Yao ◽

Paolo Ferragina ◽

Yakir Reshef ◽

Guillaume Lettre ◽

Daniel E Bauer ◽

...

Keyword(s):

Gene Expression ◽

Genetic Variants ◽

Complex Traits ◽

Association Studies ◽

Cell Types ◽

Chromatin Accessibility ◽

Supplementary Information ◽

Genome Wide Association Studies ◽

Systematic Understanding ◽

Coding Variants

Abstract Motivation Genome-wide association studies (GWAS) have identified thousands of common trait-associated genetic variants but interpretation of their function remains challenging. These genetic variants can overlap the binding sites of transcription factors (TFs) and therefore could alter gene expression. However, we currently lack a systematic understanding on how this mechanism contributes to phenotype. Results We present Motif-Raptor, a TF-centric computational tool that integrates sequence-based predictive models, chromatin accessibility, gene expression datasets and GWAS summary statistics to systematically investigate how TF function is affected by genetic variants. Given trait associated non-coding variants, Motif-Raptor can recover relevant cell types and critical TFs to drive hypotheses regarding their mechanism of action. We tested Motif-Raptor on complex traits such as rheumatoid arthritis and red blood cell count and demonstrated its ability to prioritize relevant cell types, potential regulatory TFs and non-coding SNPs which have been previously characterized and validated. Availability Motif-Raptor is freely available as a Python package at: https://github.com/pinellolab/MotifRaptor. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

TIGAR: An Improved Bayesian Tool for Transcriptomic Data Imputation Enhances Gene Mapping of Complex Traits

10.1101/507525 ◽

2018 ◽

Cited By ~ 3

Author(s):

Sini Nagpal ◽

Xiaoran Meng ◽

Michael P. Epstein ◽

Lam C. Tsoi ◽

Matthew Patrick ◽

...

Keyword(s):

Gene Expression ◽

Complex Traits ◽

Bayesian Model ◽

Genetic Architecture ◽

Bayesian Method ◽

Association Studies ◽

Gwas Data ◽

Nonparametric Bayesian ◽

Transcriptomic Data ◽

Special Cases

AbstractThe transcriptome-wide association studies (TWAS) that test for association between the study trait and the imputed gene expression levels from cis-acting expression quantitative trait loci (cis-eQTL) genotypes have successfully enhanced the discovery of genetic risk loci for complex traits. By using the gene expression imputation models fitted from reference datasets that have both genetic and transcriptomic data, TWAS facilitates gene-based tests with GWAS data while accounting for the reference transcriptomic data. The existing TWAS tools like PrediXcan and FUSION use parametric imputation models that have limitations for modeling the complex genetic architecture of transcriptomic data. Therefore, we propose an improved Bayesian method that assumes a data-driven nonparametric prior to impute gene expression. Our method is general and flexible and includes both the parametric imputation models used by PrediXcan and FUSION as special cases. Our simulation studies showed that the nonparametric Bayesian model improved both imputation R2 for transcriptomic data and the TWAS power over PrediXcan. In real applications, our nonparametric Bayesian method fitted transcriptomic imputation models for 2X number of genes with 1.7X average regression R2 over PrediXcan, thus improving the power of follow-up TWAS. Hence, the nonparametric Bayesian model is preferred for modeling the complex genetic architecture of transcriptomes and is expected to enhance transcriptome-integrated genetic association studies. We implement our Bayesian approach in a convenient software tool “TIGAR” (Transcriptome-Integrated Genetic Association Resource), which imputes transcriptomic data and performs subsequent TWAS using individual-level or summary-level GWAS data.

Download Full-text

A transcriptome-wide Mendelian randomization study to uncover tissue-dependent regulatory mechanisms across the human phenome

10.1101/563379 ◽

2019 ◽

Cited By ~ 2

Author(s):

Tom G Richardson ◽

Gibran Hemani ◽

Tom R Gaunt ◽

Caroline L Relton ◽

George Davey Smith

Keyword(s):

Gene Expression ◽

Genetic Variants ◽

Complex Traits ◽

Mendelian Randomization ◽

Drug Repositioning ◽

Association Studies ◽

Thyroid Tissue ◽

Genome Wide Association Studies ◽

Tissue Specific ◽

Genome Wide

AbstractBackgroundDeveloping insight into tissue-specific transcriptional mechanisms can help improve our understanding of how genetic variants exert their effects on complex traits and disease. By applying the principles of Mendelian randomization, we have undertaken a systematic analysis to evaluate transcriptome-wide associations between gene expression across 48 different tissue types and 395 complex traits.ResultsOverall, we identified 100,025 gene-trait associations based on conventional genome-wide corrections (P < 5 × 10−08) that also provided evidence of genetic colocalization. These results indicated that genetic variants which influence gene expression levels in multiple tissues are more likely to influence multiple complex traits. We identified many examples of tissue-specific effects, such as genetically-predicted TPO, NR3C2 and SPATA13 expression only associating with thyroid disease in thyroid tissue. Additionally, FBN2 expression was associated with both cardiovascular and lung function traits, but only when analysed in heart and lung tissue respectively.We also demonstrate that conducting phenome-wide evaluations of our results can help flag adverse on-target side effects for therapeutic intervention, as well as propose drug repositioning opportunities. Moreover, we find that exploring the tissue-dependency of associations identified by genome-wide association studies (GWAS) can help elucidate the causal genes and tissues responsible for effects, as well as uncover putative novel associations.ConclusionsThe atlas of tissue-dependent associations we have constructed should prove extremely valuable to future studies investigating the genetic determinants of complex disease. The follow-up analyses we have performed in this study are merely a guide for future research. Conducting similar evaluations can be undertaken systematically at http://mrcieu.mrsoftware.org/Tissue_MR_atlas/.

Download Full-text

CoMM: A Collaborative Mixed Model That Integrates GWAS and eQTL Data Sets to Investigate the Genetic Architecture of Complex Traits

Bioinformatics and Biology Insights ◽

10.1177/1177932219881435 ◽

2019 ◽

Vol 13 ◽

pp. 117793221988143 ◽

Cited By ~ 1

Author(s):

Kar-Fu Yeung ◽

Yi Yang ◽

Can Yang ◽

Jin Liu

Keyword(s):

Gene Expression ◽

Association Study ◽

Genetic Variants ◽

Complex Traits ◽

Mixed Model ◽

Genome Wide Association Study ◽

Data Sets ◽

Transcriptome Data ◽

Data Set ◽

Expression Levels

Genome-wide association study (GWAS) analyses have identified thousands of associations between genetic variants and complex traits. However, it is still a challenge to uncover the mechanisms underlying the association. With the growing availability of transcriptome data sets, it has become possible to perform statistical analyses targeted at identifying influential genes whose expression levels correlate with the phenotype. Methods such as PrediXcan and transcriptome-wide association study (TWAS) use the transcriptome data set to fit a predictive model for gene expression, with genetic variants as covariates. The gene expression levels for the GWAS data set are then ‘imputed’ using the prediction model, and the imputed expression levels are tested for their association with the phenotype. These methods fail to account for the uncertainty in the GWAS imputation step, and we propose a collaborative mixed model (CoMM) that addresses this limitation by jointly modelling the multiple analysis steps. We illustrate CoMM’s ability to identify relevant genes in the Northern Finland Birth Cohort 1966 data set and extend the model to handle the more widely available GWAS summary statistics.

Download Full-text

CoMM-S4: A Collaborative Mixed Model Using Summary-Level eQTL and GWAS Datasets in Transcriptome-Wide Association Studies

Frontiers in Genetics ◽

10.3389/fgene.2021.704538 ◽

2021 ◽

Vol 12 ◽

Author(s):

Yi Yang ◽

Kar-Fu Yeung ◽

Jin Liu

Keyword(s):

Likelihood Ratio ◽

Genetic Variants ◽

Association Studies ◽

Ratio Test ◽

Genome Wide Association Studies ◽

Summary Statistics ◽

Expression Trait ◽

Individual Level ◽

Trait Association ◽

Eqtl Data

Motivation: Genome-wide association studies (GWAS) have achieved remarkable success in identifying SNP-trait associations in the last decade. However, it is challenging to identify the mechanisms that connect the genetic variants with complex traits as the majority of GWAS associations are in non-coding regions. Methods that integrate genomic and transcriptomic data allow us to investigate how genetic variants may affect a trait through their effect on gene expression. These include CoMM and CoMM-S2, likelihood-ratio-based methods that integrate GWAS and eQTL studies to assess expression-trait association. However, their reliance on individual-level eQTL data render them inapplicable when only summary-level eQTL results, such as those from large-scale eQTL analyses, are available.Result: We develop an efficient probabilistic model, CoMM-S4, to explore the expression-trait association using summary-level eQTL and GWAS datasets. Compared with CoMM-S2, which uses individual-level eQTL data, CoMM-S4 requires only summary-level eQTL data. To test expression-trait association, an efficient variational Bayesian EM algorithm and a likelihood ratio test were constructed. We applied CoMM-S4 to both simulated and real data. The simulation results demonstrate that CoMM-S4 can perform as well as CoMM-S2 and S-PrediXcan, and analyses using GWAS summary statistics from Biobank Japan and eQTL summary statistics from eQTLGen and GTEx suggest novel susceptibility loci for cardiovascular diseases and osteoporosis.Availability and implementation: The developed R package is available at https://github.com/gordonliu810822/CoMM.

Download Full-text

Mendelian randomization while jointly modeling cis genetics identifies causal relationships between gene expression and lipids

Nature Communications ◽

10.1038/s41467-020-18716-x ◽

2020 ◽

Vol 11 (1) ◽

Author(s):

Adriaan van der Graaf ◽

◽

Annique Claringbould ◽

Antoine Rimbert ◽

Harm-Jan Westra ◽

...

Keyword(s):

Gene Expression ◽

Complex Traits ◽

Mendelian Randomization ◽

Low Density Lipoprotein ◽

Density Lipoprotein ◽

Summary Statistics ◽

Low Density Lipoprotein Cholesterol ◽

Causal Inferences ◽

Individual Level ◽

Level Data

Abstract Inference of causality between gene expression and complex traits using Mendelian randomization (MR) is confounded by pleiotropy and linkage disequilibrium (LD) of gene-expression quantitative trait loci (eQTL). Here, we propose an MR method, MR-link, that accounts for unobserved pleiotropy and LD by leveraging information from individual-level data, even when only one eQTL variant is present. In simulations, MR-link shows false-positive rates close to expectation (median 0.05) and high power (up to 0.89), outperforming all other tested MR methods and coloc. Application of MR-link to low-density lipoprotein cholesterol (LDL-C) measurements in 12,449 individuals with expression and protein QTL summary statistics from blood and liver identifies 25 genes causally linked to LDL-C. These include the known SORT1 and ApoE genes as well as PVRL2, located in the APOE locus, for which a causal role in liver was not known. Our results showcase the strength of MR-link for transcriptome-wide causal inferences.

Download Full-text

An iterative approach to detect pleiotropy and perform Mendelian Randomization analysis using GWAS summary statistics

Bioinformatics ◽

10.1093/bioinformatics/btaa985 ◽

2020 ◽

Author(s):

Xiaofeng Zhu ◽

Xiaoyin Li ◽

Rong Xu ◽

Tao Wang

Keyword(s):

Complex Traits ◽

Mendelian Randomization ◽

Causal Effect ◽

Association Studies ◽

Real Data ◽

Supplementary Information ◽

Genome Wide Association Studies ◽

Summary Statistics ◽

Causal Relationships ◽

Multiple Traits

Abstract Motivation The overall association evidence of a genetic variant with multiple traits can be evaluated by cross-phenotype association analysis using summary statistics from genome-wide association studies. Further dissecting the association pathways from a variant to multiple traits is important to understand the biological causal relationships among complex traits. Results Here, we introduce a flexible and computationally efficient Iterative Mendelian Randomization and Pleiotropy (IMRP) approach to simultaneously search for horizontal pleiotropic variants and estimate causal effect. Extensive simulations and real data applications suggest that IMRP has similar or better performance than existing Mendelian Randomization methods for both causal effect estimation and pleiotropic variant detection. The developed pleiotropy test is further extended to detect colocalization for multiple variants at a locus. IMRP will greatly facilitate our understanding of causal relationships underlying complex traits, in particular, when a large number of genetic instrumental variables are used for evaluating multiple traits. Availability and implementation The software IMRP is available at https://github.com/XiaofengZhuCase/IMRP. The simulation codes can be downloaded at http://hal.case.edu/∼xxz10/zhu-web/ under the link: MR Simulations software. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Estimating genetic nurture with summary statistics of multigenerational genome-wide association studies

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.2023184118 ◽

2021 ◽

Vol 118 (25) ◽

pp. e2023184118

Author(s):

Yuchang Wu ◽

Xiaoyuan Zhong ◽

Yunong Lin ◽

Zijie Zhao ◽

Jiawen Chen ◽

...

Keyword(s):

Complex Traits ◽

Association Studies ◽

Genetic Correlations ◽

Genetic Effects ◽

Genome Wide Association Studies ◽

Summary Statistics ◽

Phenotypic Data ◽

Individual Level ◽

Indirect Genetic Effects ◽

Genome Wide

Marginal effect estimates in genome-wide association studies (GWAS) are mixtures of direct and indirect genetic effects. Existing methods to dissect these effects require family-based, individual-level genetic, and phenotypic data with large samples, which is difficult to obtain in practice. Here, we propose a statistical framework to estimate direct and indirect genetic effects using summary statistics from GWAS conducted on own and offspring phenotypes. Applied to birth weight, our method showed nearly identical results with those obtained using individual-level data. We also decomposed direct and indirect genetic effects of educational attainment (EA), which showed distinct patterns of genetic correlations with 45 complex traits. The known genetic correlations between EA and higher height, lower body mass index, less-active smoking behavior, and better health outcomes were mostly explained by the indirect genetic component of EA. In contrast, the consistently identified genetic correlation of autism spectrum disorder (ASD) with higher EA resides in the direct genetic component. A polygenic transmission disequilibrium test showed a significant overtransmission of the direct component of EA from healthy parents to ASD probands. Taken together, we demonstrate that traditional GWAS approaches, in conjunction with offspring phenotypic data collection in existing cohorts, could greatly benefit studies on genetic nurture and shed important light on the interpretation of genetic associations for human complex traits.

Download Full-text

Assumptions about frequency-dependent architectures of complex traits bias measures of functional enrichment

10.1101/2020.10.23.352427 ◽

2020 ◽

Author(s):

Shadi Zabad ◽

Aaron P. Ragsdale ◽

Rosie Sun ◽

Yue Li ◽

Simon Gravel

Keyword(s):

Linkage Disequilibrium ◽

Genetic Variants ◽

Complex Traits ◽

Statistical Difference ◽

Functional Enrichment ◽

Frequency Effect ◽

Summary Statistics ◽

Biological Interpretation ◽

Implicit And Explicit ◽

The Relationship

AbstractLinkage-Disequilibrium Score Regression (LDSC) is a popular framework for analyzing GWAS summary statistics that allows for estimating SNP heritability, confounding, and functional enrichment of genetic variants with different annotations. Recent work has highlighted the influence of implicit and explicit assumptions of the model on the biological interpretation of the results. In this work, we explored a formulation of LDSC that replaces the r2 measure of LD with a recently-proposed unbiased estimator of the D2 statistic. In addition to modest statistical difference across estimators, this derivation highlighted implicit and unrealistic assumptions about the relationship between allele frequency, effect size, and annotation status. We carry out a systematic comparison of alternative LDSC formulations by applying them to summary statistics from 47 GWAS traits. Our results show that commonly used models likely underestimate functional enrichment. These results highlight the importance of calibrating the LDSC model to achieve a more robust understanding of polygenic traits.

Download Full-text