Integrating Comprehensive Functional Annotations to Boost Power and Accuracy in Gene-Based Association Analysis

AbstractGene-based association tests aggregate genotypes across multiple variants for each gene, providing an interpretable gene-level analysis framework for genome-wide association studies (GWAS). Early gene-based test applications often focused on rare coding variants; a more recent wave of gene-based methods, e.g. TWAS, use eQTLs to interrogate regulatory associations. Regulatory variants are expected to be particularly valuable for gene-based analysis, since most GWAS associations to date are non-coding. However, identifying causal genes from regulatory associations remains challenging and contentious. Here, we present a statistical framework and computational tool to integrate heterogeneous annotations with GWAS summary statistics for gene-based analysis, applied with comprehensive coding and tissue-specific regulatory annotations. We compare power and accuracy identifying causal genes across single-annotation, omnibus, and annotation-agnostic gene-based tests in simulation studies and an analysis of 128 traits from the UK Biobank, and find that incorporating heterogeneous annotations in gene-based association analysis increases power and performance identifying causal genes.

Download Full-text

Integrating comprehensive functional annotations to boost power and accuracy in gene-based association analysis

PLoS Genetics ◽

10.1371/journal.pgen.1009060 ◽

2020 ◽

Vol 16 (12) ◽

pp. e1009060

Author(s):

Corbin Quick ◽

Xiaoquan Wen ◽

Gonçalo Abecasis ◽

Michael Boehnke ◽

Hyun Min Kang

Keyword(s):

Association Analysis ◽

Association Studies ◽

Early Gene ◽

Genome Wide Association Studies ◽

Functional Annotations ◽

Regulatory Variants ◽

Causal Genes ◽

And Performance ◽

The Uk ◽

Coding Variants

Gene-based association tests aggregate genotypes across multiple variants for each gene, providing an interpretable gene-level analysis framework for genome-wide association studies (GWAS). Early gene-based test applications often focused on rare coding variants; a more recent wave of gene-based methods, e.g. TWAS, use eQTLs to interrogate regulatory associations. Regulatory variants are expected to be particularly valuable for gene-based analysis, since most GWAS associations to date are non-coding. However, identifying causal genes from regulatory associations remains challenging and contentious. Here, we present a statistical framework and computational tool to integrate heterogeneous annotations with GWAS summary statistics for gene-based analysis, applied with comprehensive coding and tissue-specific regulatory annotations. We compare power and accuracy identifying causal genes across single-annotation, omnibus, and annotation-agnostic gene-based tests in simulation studies and an analysis of 128 traits from the UK Biobank, and find that incorporating heterogeneous annotations in gene-based association analysis increases power and performance identifying causal genes.

Download Full-text

Assessing the contribution of rare-to-common protein-coding variants to circulating metabolic biomarker levels via 412,394 UK Biobank exome sequences

10.1101/2021.12.24.21268381 ◽

2021 ◽

Author(s):

Abhishek Nag ◽

Lawrence Middleton ◽

Ryan S Dhindsa ◽

Dimitrios Vitsios ◽

Eleanor M Wigmore ◽

...

Keyword(s):

Gene Networks ◽

Rare Variants ◽

Association Studies ◽

Low Frequency ◽

Genome Wide Association Studies ◽

Uk Biobank ◽

Protein Coding ◽

The Uk ◽

Metabolic Biomarkers ◽

Coding Variants

Genome-wide association studies have established the contribution of common and low frequency variants to metabolic biomarkers in the UK Biobank (UKB); however, the role of rare variants remains to be assessed systematically. We evaluated rare coding variants for 198 metabolic biomarkers, including metabolites assayed by Nightingale Health, using exome sequencing in participants from four genetically diverse ancestries in the UKB (N=412,394). Gene-level collapsing analysis, that evaluated a range of genetic architectures, identified a total of 1,303 significant relationships between genes and metabolic biomarkers (p<1x10-8), encompassing 207 distinct genes. These include associations between rare non-synonymous variants in GIGYF1 and glucose and lipid biomarkers, SYT7 and creatinine, and others, which may provide insights into novel disease biology. Comparing to a previous microarray-based genotyping study in the same cohort, we observed that 40% of gene-biomarker relationships identified in the collapsing analysis were novel. Finally, we applied Gene-SCOUT, a novel tool that utilises the gene-biomarker association statistics from the collapsing analysis to identify genes having similar biomarker fingerprints and thus expand our understanding of gene networks.

Download Full-text

Predicting target genes of non-coding regulatory variants with IRT

Bioinformatics ◽

10.1093/bioinformatics/btaa254 ◽

2020 ◽

Vol 36 (16) ◽

pp. 4440-4448 ◽

Cited By ~ 1

Author(s):

Zhenqin Wu ◽

Nilah M Ioannidis ◽

James Zou

Keyword(s):

Target Genes ◽

Cross Validation ◽

Learning Algorithm ◽

Association Studies ◽

Characteristic Curve ◽

Gc Content ◽

Supplementary Information ◽

Genome Wide Association Studies ◽

Regulatory Variants ◽

Coding Variants

Abstract Summary Interpreting genetic variants of unknown significance (VUS) is essential in clinical applications of genome sequencing for diagnosis and personalized care. Non-coding variants remain particularly difficult to interpret, despite making up a large majority of trait associations identified in genome-wide association studies (GWAS) analyses. Predicting the regulatory effects of non-coding variants on candidate genes is a key step in evaluating their clinical significance. Here, we develop a machine-learning algorithm, Inference of Connected expression quantitative trait loci (eQTLs) (IRT), to predict the regulatory targets of non-coding variants identified in studies of eQTLs. We assemble datasets using eQTL results from the Genotype-Tissue Expression (GTEx) project and learn to separate positive and negative pairs based on annotations characterizing the variant, gene and the intermediate sequence. IRT achieves an area under the receiver operating characteristic curve (ROC-AUC) of 0.799 using random cross-validation, and 0.700 for a more stringent position-based cross-validation. Further evaluation on rare variants and experimentally validated regulatory variants shows a significant enrichment in IRT identifying the true target genes versus negative controls. In gene-ranking experiments, IRT achieves a top-1 accuracy of 50% and top-3 accuracy of 90%. Salient features, including GC-content, histone modifications and Hi-C interactions are further analyzed and visualized to illustrate their influences on predictions. IRT can be applied to any VUS of interest and each candidate nearby gene to output a score reflecting the likelihood of regulatory effect on the expression level. These scores can be used to prioritize variants and genes to assist in patient diagnosis and GWAS follow-up studies. Availability and implementation Codes and data used in this work are available at https://github.com/miaecle/eQTL_Trees. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Genetics of rheumatoid arthritis: 2018 status

Annals of the Rheumatic Diseases ◽

10.1136/annrheumdis-2018-213678 ◽

2018 ◽

Vol 78 (4) ◽

pp. 446-453 ◽

Cited By ~ 37

Author(s):

Yukinori Okada ◽

Stephen Eyre ◽

Akari Suzuki ◽

Yuta Kochi ◽

Kazuhiko Yamamoto

Keyword(s):

Rheumatoid Arthritis ◽

Association Studies ◽

Open Chromatin ◽

Genome Wide Association Studies ◽

Genetic Components ◽

Risk Variants ◽

Causal Genes ◽

Causal Variants ◽

Resolution Mapping ◽

Coding Variants

Study of the genetics of rheumatoid arthritis (RA) began about four decades ago with the discovery of HLA-DRB1. Since the beginning of this century, a number of non-HLA risk loci have been identified through genome-wide association studies (GWAS). We now know that over 100 loci are associated with RA risk. Because genetic information implies a clear causal relationship to the disease, research into the pathogenesis of RA should be promoted. However, only 20% of GWAS loci contain coding variants, with the remaining variants occurring in non-coding regions, and therefore, the majority of causal genes and causal variants remain to be identified. The use of epigenetic studies, high-resolution mapping of open chromatin, chromosomal conformation technologies and other approaches could identify many of the missing links between genetic risk variants and causal genetic components, thus expanding our understanding of RA genetics.

Download Full-text

GARFIELD - GWAS Analysis of Regulatory or Functional Information Enrichment with LD correction

10.1101/085738 ◽

2016 ◽

Cited By ~ 17

Author(s):

Valentina Iotchkova ◽

Graham R.S. Ritchie ◽

Matthias Geihs ◽

Sandro Morganella ◽

Josine L. Min ◽

...

Keyword(s):

Association Studies ◽

R Package ◽

Genome Wide Association Studies ◽

Protein Coding ◽

Functional Annotations ◽

Novel Approach ◽

Genome Wide ◽

Functional Consequences ◽

Genomic Regions ◽

Coding Variants

Loci discovered by genome-wide association studies (GWAS) predominantly map outside protein-coding genes. The interpretation of functional consequences of non-coding variants can be greatly enhanced by catalogs of regulatory genomic regions in cell lines and primary tissues. However, robust and readily applicable methods are still lacking to systematically evaluate the contribution of these regions to genetic variation implicated in diseases or quantitative traits. Here we propose a novel approach that leverages GWAS findings with regulatory or functional annotations to classify features relevant to a phenotype of interest. Within our framework, we account for major sources of confounding that current methods do not offer. We further assess enrichment statistics for 27 GWAS traits within regulatory regions from the ENCODE and Roadmap projects. We characterise unique enrichment patterns for traits and annotations, driving novel biological insights. The method is implemented in standalone software and R package to facilitate its application by the research community.

Download Full-text

Integrating co-expression networks with GWAS to prioritize causal genes in maize

10.1101/221655 ◽

2017 ◽

Cited By ~ 2

Author(s):

Robert J. Schaefer ◽

Jean-Michel Michno ◽

Joseph Jeffers ◽

Owen Hoekenga ◽

Brian Dilkes ◽

...

Keyword(s):

Candidate Genes ◽

Large Scale ◽

Association Studies ◽

Strong Dependence ◽

Genome Wide Association Studies ◽

Functional Annotations ◽

Maize Seeds ◽

Causal Genes ◽

Mutant Phenotypes ◽

Molecular Components

AbstractBackgroundGenome wide association studies (GWAS) have identified thousands of loci linked to hundreds of traits in many different species. However, because linkage equilibrium implicates a broad region surrounding each identified locus, the causal genes often remain unknown. This problem is especially pronounced in non-human, non-model species where functional annotations are sparse and there is frequently little information available for prioritizing candidate genes.ResultsTo address this issue, we developed a computational approach called Camoco (Co-Analysis of Molecular Components) that systematically integrates loci identified by GWAS with gene co-expression networks to prioritize putative causal genes. We applied Camoco to prioritize candidate genes from a large-scale GWAS examining the accumulation of 17 different elements in maize seeds. Camoco identified statistically significant subnetworks for the majority of traits examined, producing a prioritized list of high-confidence causal genes for several agronomically important maize traits. Two candidate genes identified by our approach were validated through analysis of mutant phenotypes. Strikingly, we observed a strong dependence in the performance of our approach on the type of co-expression network used: expression variation across genetically diverse individuals in a relevant tissue context (in our case, maize roots) outperformed other alternatives.ConclusionsOur study demonstrates that co-expression networks can provide a powerful basis for prioritizing candidate causal genes from GWAS loci, but suggests that the success of such strategies can highly depend on the gene expression data context. Both the Camoco software and the lessons on integrating GWAS data with co-expression networks generalize to species beyond maize.

Download Full-text

Rare variants in neuronal excitability genes influence risk for bipolar disorder

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1424958112 ◽

2015 ◽

Vol 112 (11) ◽

pp. 3576-3581 ◽

Cited By ~ 93

Author(s):

Seth A. Ament ◽

Szabolcs Szelinger ◽

Gustavo Glusman ◽

Justin Ashworth ◽

Liping Hou ◽

...

Keyword(s):

Bipolar Disorder ◽

Candidate Genes ◽

Neuronal Excitability ◽

Rare Variants ◽

Association Studies ◽

Missense Variant ◽

Genome Wide Association Studies ◽

Regulatory Variants ◽

Genes Encoding ◽

Coding Variants

We sequenced the genomes of 200 individuals from 41 families multiply affected with bipolar disorder (BD) to identify contributions of rare variants to genetic risk. We initially focused on 3,087 candidate genes with known synaptic functions or prior evidence from genome-wide association studies. BD pedigrees had an increased burden of rare variants in genes encoding neuronal ion channels, including subunits of GABAA receptors and voltage-gated calcium channels. Four uncommon coding and regulatory variants also showed significant association, including a missense variant in GABRA6. Targeted sequencing of 26 of these candidate genes in an additional 3,014 cases and 1,717 controls confirmed rare variant associations in ANK3, CACNA1B, CACNA1C, CACNA1D, CACNG2, CAMK2A, and NGF. Variants in promoters and 5′ and 3′ UTRs contributed more strongly than coding variants to risk for BD, both in pedigrees and in the case-control cohort. The genes and pathways identified in this study regulate diverse aspects of neuronal excitability. We conclude that rare variants in neuronal excitability genes contribute to risk for BD.

Download Full-text

sumSTAAR: a flexible framework for gene-based association studies using GWAS summary statistics

10.1101/2021.10.25.465680 ◽

2021 ◽

Author(s):

Nadezhda M Belonogova ◽

Gulnara R Svishcheva ◽

Anatoly V Kirichenko ◽

Yakov A Tsepilov ◽

Tatiana I Axenovich

Keyword(s):

Association Analysis ◽

Complex Traits ◽

Association Studies ◽

Genome Wide Association Studies ◽

Summary Statistics ◽

Functional Annotations ◽

Mapping Tool ◽

Genome Wide ◽

Fixed Set ◽

Project Data

Gene-based association analysis is an effective gene mapping tool. Many gene-based methods have been proposed recently. However, their power depends on the underlying genetic architecture, which is rarely known in complex traits, and so it is likely that a combination of such methods could serve as a universal approach. Several frameworks combining different gene-based methods have been developed. However, they all imply a fixed set of methods, weights and functional annotations. Moreover, most of them use individual phenotypes and genotypes as input data. Here, we introduce sumSTAAR, a framework for gene-based association analysis using summary statistics obtained from genome-wide association studies (GWAS). It is an extended and modified version of STAAR framework proposed by Li and colleagues in 2020. The sumSTAAR framework offers a wider range of gene-based methods to combine. It allows the user to arbitrarily define a set of these methods, weighting functions and probabilities of genetic variants being causal. The methods used in the framework were adapted to analyse genes with large number of SNPs to decrease the running time. The framework includes the polygene pruning procedure to guard against the influence of the strong GWAS signals outside the gene. We also present new improved matrices of correlations between the genotypes of variants within genes. These matrices estimated on a sample of 265,000 individuals are a state-of-the-art replacement of widely used matrices based on the 1000 Genomes Project data.

Download Full-text

Screening for functional transcriptional and splicing regulatory variants with GenIE

Nucleic Acids Research ◽

10.1093/nar/gkaa960 ◽

2020 ◽

Vol 48 (22) ◽

pp. e131-e131

Author(s):

Sarah E Cooper ◽

Jeremy Schwartzentruber ◽

Erica Bello ◽

Eve L Coomber ◽

Andrew R Bassett

Keyword(s):

Genome Engineering ◽

Association Studies ◽

Screening Method ◽

Regulatory Elements ◽

Genome Wide Association Studies ◽

Regulatory Variants ◽

Causal Genes ◽

Induced Pluripotent ◽

Credible Set ◽

The Impact

Abstract Genome-wide association studies (GWAS) have identified numerous genetic loci underlying human diseases, but a fundamental challenge remains to accurately identify the underlying causal genes and variants. Here, we describe an arrayed CRISPR screening method, Genome engineering-based Interrogation of Enhancers (GenIE), which assesses the effects of defined alleles on transcription or splicing when introduced in their endogenous genomic locations. We use this sensitive assay to validate the activity of transcriptional enhancers and splice regulatory elements in human induced pluripotent stem cells (hiPSCs), and develop a software package (rgenie) to analyse the data. We screen the 99% credible set of Alzheimer's disease (AD) GWAS variants identified at the clusterin (CLU) locus to identify a subset of likely causal variants, and employ GenIE to understand the impact of specific mutations on splicing efficiency. We thus establish GenIE as an efficient tool to rapidly screen for the role of transcribed variants on gene expression.

Download Full-text

Regulatory variants explain much more heritability than coding variants across 11 common diseases

10.1101/004309 ◽

2014 ◽

Cited By ~ 5

Author(s):

Alexander Gusev ◽

S Hong Lee ◽

Benjamin M Neale ◽

Gosia Trynka ◽

Bjarni J Vilhjalmsson ◽

...

Keyword(s):

Significant Contribution ◽

Association Studies ◽

Cell Types ◽

Genome Wide Association Studies ◽

Functional Categories ◽

Regulatory Variants ◽

Common Diseases ◽

Genome Wide ◽

Cell Type Specific ◽

Coding Variants

Common variants implicated by genome-wide association studies (GWAS) of complex diseases are known to be enriched for coding and regulatory variants. We applied methods to partition the heritability explained by genotyped SNPs (h2g) across functional categories (while accounting for shared variance due to linkage disequilibrium) to genotype and imputed data for 11 common diseases. DNaseI Hypersensitivity Sites (DHS) from 218 cell-types, spanning 16% of the genome, explained an average of 79% of h2g (5.1× enrichment; P < 10−20); further enrichment was observed at enhancer and cell-type specific DHS elements. The enrichments were much smaller in analyses that did not use imputed data or were restricted to GWAS- associated SNPs. In contrast, coding variants, spanning 1% of the genome, explained only 8% of h2g (13.8× enrichment; P = 5 × 10−4). We replicated these findings but found no significant contribution from rare coding variants in an independent schizophrenia cohort genotyped on GWAS and exome chips.

Download Full-text