scholarly journals Predicting target genes of non-coding regulatory variants with IRT

2020 ◽  
Vol 36 (16) ◽  
pp. 4440-4448 ◽  
Author(s):  
Zhenqin Wu ◽  
Nilah M Ioannidis ◽  
James Zou

Abstract Summary Interpreting genetic variants of unknown significance (VUS) is essential in clinical applications of genome sequencing for diagnosis and personalized care. Non-coding variants remain particularly difficult to interpret, despite making up a large majority of trait associations identified in genome-wide association studies (GWAS) analyses. Predicting the regulatory effects of non-coding variants on candidate genes is a key step in evaluating their clinical significance. Here, we develop a machine-learning algorithm, Inference of Connected expression quantitative trait loci (eQTLs) (IRT), to predict the regulatory targets of non-coding variants identified in studies of eQTLs. We assemble datasets using eQTL results from the Genotype-Tissue Expression (GTEx) project and learn to separate positive and negative pairs based on annotations characterizing the variant, gene and the intermediate sequence. IRT achieves an area under the receiver operating characteristic curve (ROC-AUC) of 0.799 using random cross-validation, and 0.700 for a more stringent position-based cross-validation. Further evaluation on rare variants and experimentally validated regulatory variants shows a significant enrichment in IRT identifying the true target genes versus negative controls. In gene-ranking experiments, IRT achieves a top-1 accuracy of 50% and top-3 accuracy of 90%. Salient features, including GC-content, histone modifications and Hi-C interactions are further analyzed and visualized to illustrate their influences on predictions. IRT can be applied to any VUS of interest and each candidate nearby gene to output a score reflecting the likelihood of regulatory effect on the expression level. These scores can be used to prioritize variants and genes to assist in patient diagnosis and GWAS follow-up studies. Availability and implementation Codes and data used in this work are available at https://github.com/miaecle/eQTL_Trees. Supplementary information Supplementary data are available at Bioinformatics online.

2019 ◽  
Author(s):  
Moli Huang ◽  
Yunpeng Wang ◽  
Manqiu Yang ◽  
Jun Yan ◽  
Henry Yang ◽  
...  

Abstract Summary Cancer hallmarks rely on its specific transcriptional programs, which are dysregulated by multiple mechanisms, including genomic aberrations in the DNA regulatory regions. Genome-wide association studies have shown many variants are found within putative enhancer elements. To provide insights into the regulatory role of enhancer-associated non-coding variants in cancer epigenome, and to facilitate the identification of functional non-coding mutations, we present dbInDel, a database where we have comprehensively analyzed enhancer-associated insertion and deletion variants for both human and murine samples using ChIP-Seq data. Moreover, we provide the identification and visualization of upstream TF binding motifs in InDel-containing enhancers. Downstream target genes are also predicted and analyzed in the context of cancer biology. The dbInDel database promotes the investigation of functional contributions of non-coding variants in cancer epigenome. Availability and implementation The database, dbInDel, can be accessed from http://enhancer-indel.cam-su.org/. Supplementary information Supplementary data are available at Bioinformatics online.


PLoS ONE ◽  
2021 ◽  
Vol 16 (9) ◽  
pp. e0257265
Author(s):  
Seung-Soo Kim ◽  
Adam D. Hudgins ◽  
Jiping Yang ◽  
Yizhou Zhu ◽  
Zhidong Tu ◽  
...  

Type 1 diabetes (T1D) is an organ-specific autoimmune disease, whereby immune cell-mediated killing leads to loss of the insulin-producing β cells in the pancreas. Genome-wide association studies (GWAS) have identified over 200 genetic variants associated with risk for T1D. The majority of the GWAS risk variants reside in the non-coding regions of the genome, suggesting that gene regulatory changes substantially contribute to T1D. However, identification of causal regulatory variants associated with T1D risk and their affected genes is challenging due to incomplete knowledge of non-coding regulatory elements and the cellular states and processes in which they function. Here, we performed a comprehensive integrated post-GWAS analysis of T1D to identify functional regulatory variants in enhancers and their cognate target genes. Starting with 1,817 candidate T1D SNPs defined from the GWAS catalog and LDlink databases, we conducted functional annotation analysis using genomic data from various public databases. These include 1) Roadmap Epigenomics, ENCODE, and RegulomeDB for epigenome data; 2) GTEx for tissue-specific gene expression and expression quantitative trait loci data; and 3) lncRNASNP2 for long non-coding RNA data. Our results indicated a prevalent enhancer-based immune dysregulation in T1D pathogenesis. We identified 26 high-probability causal enhancer SNPs associated with T1D, and 64 predicted target genes. The majority of the target genes play major roles in antigen presentation and immune response and are regulated through complex transcriptional regulatory circuits, including those in HLA (6p21) and non-HLA (16p11.2) loci. These candidate causal enhancer SNPs are supported by strong evidence and warrant functional follow-up studies.


2017 ◽  
Vol 242 (13) ◽  
pp. 1325-1334 ◽  
Author(s):  
Yizhou Zhu ◽  
Cagdas Tazearslan ◽  
Yousin Suh

Genome-wide association studies have shown that the far majority of disease-associated variants reside in the non-coding regions of the genome, suggesting that gene regulatory changes contribute to disease risk. To identify truly causal non-coding variants and their affected target genes remains challenging but is a critical step to translate the genetic associations to molecular mechanisms and ultimately clinical applications. Here we review genomic/epigenomic resources and in silico tools that can be used to identify causal non-coding variants and experimental strategies to validate their functionalities. Impact statement Most signals from genome-wide association studies (GWASs) map to the non-coding genome, and functional interpretation of these associations remained challenging. We reviewed recent progress in methodologies of studying the non-coding genome and argued that no single approach allows one to effectively identify the causal regulatory variants from GWAS results. By illustrating the advantages and limitations of each method, our review potentially provided a guideline for taking a combinatorial approach to accurately predict, prioritize, and eventually experimentally validate the causal variants.


PLoS Genetics ◽  
2020 ◽  
Vol 16 (12) ◽  
pp. e1009060
Author(s):  
Corbin Quick ◽  
Xiaoquan Wen ◽  
Gonçalo Abecasis ◽  
Michael Boehnke ◽  
Hyun Min Kang

Gene-based association tests aggregate genotypes across multiple variants for each gene, providing an interpretable gene-level analysis framework for genome-wide association studies (GWAS). Early gene-based test applications often focused on rare coding variants; a more recent wave of gene-based methods, e.g. TWAS, use eQTLs to interrogate regulatory associations. Regulatory variants are expected to be particularly valuable for gene-based analysis, since most GWAS associations to date are non-coding. However, identifying causal genes from regulatory associations remains challenging and contentious. Here, we present a statistical framework and computational tool to integrate heterogeneous annotations with GWAS summary statistics for gene-based analysis, applied with comprehensive coding and tissue-specific regulatory annotations. We compare power and accuracy identifying causal genes across single-annotation, omnibus, and annotation-agnostic gene-based tests in simulation studies and an analysis of 128 traits from the UK Biobank, and find that incorporating heterogeneous annotations in gene-based association analysis increases power and performance identifying causal genes.


2020 ◽  
Author(s):  
Sébastian Méric de Bellefon ◽  
Florian Thibord ◽  
Paul L. Auer ◽  
John Blangero ◽  
Zeynep H Coban-Akdemir ◽  
...  

AbstractMotivationWhole-genome DNA sequencing (WGS) enables the discovery of non-coding variants, but tools are lacking to prioritize the subset that functionally impacts human phenotypes. DNA sequence variants that disrupt or create transcription factor binding sites (TFBS) can modulate gene expression. find-tfbs efficiently scans phased WGS in large cohorts to identify and count TFBSs in regulatory sequences. This information can then be used in association testing to find putatively functional non-coding variants associated with complex human diseases or traits.ResultsWe applied find-tfbs to discover functional non-coding variants associated with hematological traits in the NHLBI Trans-Omics for Precision Medicine (TOPMed) WGS dataset (Nmax=44,709). We identified >2000 associations at P<1×10−9, implicating specific blood cell-types, transcription factors and causal genes. The vast majority of these associations are captured by variants identified in large genome-wide association studies (GWAS) for blood-cell traits. find-tfbs is computationally efficient and robust, allowing for the rapid identification of non-coding variants associated with multiple human phenotypes in very large sample size.Availabilityhttps://github.com/Helkafen/find-tfbs and https://github.com/Helkafen/[email protected] and [email protected] informationSupplementary data are available.


2015 ◽  
Vol 112 (11) ◽  
pp. 3576-3581 ◽  
Author(s):  
Seth A. Ament ◽  
Szabolcs Szelinger ◽  
Gustavo Glusman ◽  
Justin Ashworth ◽  
Liping Hou ◽  
...  

We sequenced the genomes of 200 individuals from 41 families multiply affected with bipolar disorder (BD) to identify contributions of rare variants to genetic risk. We initially focused on 3,087 candidate genes with known synaptic functions or prior evidence from genome-wide association studies. BD pedigrees had an increased burden of rare variants in genes encoding neuronal ion channels, including subunits of GABAA receptors and voltage-gated calcium channels. Four uncommon coding and regulatory variants also showed significant association, including a missense variant in GABRA6. Targeted sequencing of 26 of these candidate genes in an additional 3,014 cases and 1,717 controls confirmed rare variant associations in ANK3, CACNA1B, CACNA1C, CACNA1D, CACNG2, CAMK2A, and NGF. Variants in promoters and 5′ and 3′ UTRs contributed more strongly than coding variants to risk for BD, both in pedigrees and in the case-control cohort. The genes and pathways identified in this study regulate diverse aspects of neuronal excitability. We conclude that rare variants in neuronal excitability genes contribute to risk for BD.


2021 ◽  
Vol 12 ◽  
Author(s):  
Wei Wang ◽  
Fengju Song ◽  
Xiangling Feng ◽  
Xinlei Chu ◽  
Hongji Dai ◽  
...  

Identifying causal regulatory variants and their target genes from the majority of non-coding disease-associated genetic loci is the main challenge in post-Genome-Wide Association Studies (GWAS) functional studies. Although chromosome conformation capture (3C) and its derivative technologies have been successfully applied to nominate putative causal genes for non-coding variants, many GWAS target genes have not been identified yet. This study generated a high-resolution contact map from epithelial ovarian cancer (EOC) cells with two H3K27ac-HiChIP libraries and analyzed the underlying gene networks for 15 risk loci identified from the largest EOC GWAS. By combinatory analysis of 4,021 fine-mapped credible variants of EOC GWAS and high-resolution contact map, we obtained 162 target genes that mainly enriched in cancer related pathways. Compared with GTEx eQTL genes in ovarian tissue and annotated proximal genes, 132 HiChIP targets were first identified for EOC causal variants. More than half of the credible variants (CVs) involved interactions that were over 185 kb in distance, indicating that long-range transcriptional regulation is an important mechanism for the function of GWAS variants in EOC. We also found that many HiChIP gene targets showed significantly differential expressions between normal ovarian and EOC tumor samples. We validated one of these targets by manipulating the rs9303542 located region with CRISPR-Cas9 deletion and dCas9-VP64 activation experiments and found altered expression of HOXB7 and HOXB8 at 17q21.32. This study presents a systematic analysis to identify putative target genes for causal variants of EOC, providing an in-depth investigation of the mechanisms of non-coding regulatory variants in the etiology and pathogenesis of ovarian cancer.


2019 ◽  
Author(s):  
Corbin Quick ◽  
Xiaoquan Wen ◽  
Gonçalo Abecasis ◽  
Michael Boehnke ◽  
Hyun Min Kang

AbstractGene-based association tests aggregate genotypes across multiple variants for each gene, providing an interpretable gene-level analysis framework for genome-wide association studies (GWAS). Early gene-based test applications often focused on rare coding variants; a more recent wave of gene-based methods, e.g. TWAS, use eQTLs to interrogate regulatory associations. Regulatory variants are expected to be particularly valuable for gene-based analysis, since most GWAS associations to date are non-coding. However, identifying causal genes from regulatory associations remains challenging and contentious. Here, we present a statistical framework and computational tool to integrate heterogeneous annotations with GWAS summary statistics for gene-based analysis, applied with comprehensive coding and tissue-specific regulatory annotations. We compare power and accuracy identifying causal genes across single-annotation, omnibus, and annotation-agnostic gene-based tests in simulation studies and an analysis of 128 traits from the UK Biobank, and find that incorporating heterogeneous annotations in gene-based association analysis increases power and performance identifying causal genes.


Author(s):  
Qiuming Yao ◽  
Paolo Ferragina ◽  
Yakir Reshef ◽  
Guillaume Lettre ◽  
Daniel E Bauer ◽  
...  

Abstract Motivation Genome-wide association studies (GWAS) have identified thousands of common trait-associated genetic variants but interpretation of their function remains challenging. These genetic variants can overlap the binding sites of transcription factors (TFs) and therefore could alter gene expression. However, we currently lack a systematic understanding on how this mechanism contributes to phenotype. Results We present Motif-Raptor, a TF-centric computational tool that integrates sequence-based predictive models, chromatin accessibility, gene expression datasets and GWAS summary statistics to systematically investigate how TF function is affected by genetic variants. Given trait associated non-coding variants, Motif-Raptor can recover relevant cell types and critical TFs to drive hypotheses regarding their mechanism of action. We tested Motif-Raptor on complex traits such as rheumatoid arthritis and red blood cell count and demonstrated its ability to prioritize relevant cell types, potential regulatory TFs and non-coding SNPs which have been previously characterized and validated. Availability Motif-Raptor is freely available as a Python package at: https://github.com/pinellolab/MotifRaptor. Supplementary information Supplementary data are available at Bioinformatics online.


2014 ◽  
Author(s):  
Alexander Gusev ◽  
S Hong Lee ◽  
Benjamin M Neale ◽  
Gosia Trynka ◽  
Bjarni J Vilhjalmsson ◽  
...  

Common variants implicated by genome-wide association studies (GWAS) of complex diseases are known to be enriched for coding and regulatory variants. We applied methods to partition the heritability explained by genotyped SNPs (h2g) across functional categories (while accounting for shared variance due to linkage disequilibrium) to genotype and imputed data for 11 common diseases. DNaseI Hypersensitivity Sites (DHS) from 218 cell-types, spanning 16% of the genome, explained an average of 79% of h2g (5.1× enrichment; P < 10−20); further enrichment was observed at enhancer and cell-type specific DHS elements. The enrichments were much smaller in analyses that did not use imputed data or were restricted to GWAS- associated SNPs. In contrast, coding variants, spanning 1% of the genome, explained only 8% of h2g (13.8× enrichment; P = 5 × 10−4). We replicated these findings but found no significant contribution from rare coding variants in an independent schizophrenia cohort genotyped on GWAS and exome chips.


Sign in / Sign up

Export Citation Format

Share Document