scholarly journals A semisupervised model to predict regulatory effects of genetic variants at single nucleotide resolution using massively parallel reporter assays

Author(s):  
Zikun Yang ◽  
Chen Wang ◽  
Stephanie Erjavec ◽  
Lynn Petukhova ◽  
Angela Christiano ◽  
...  

Abstract Motivation Predicting regulatory effects of genetic variants is a challenging but important problem in functional genomics. Given the relatively low sensitivity of functional assays, and the pervasiveness of class imbalance in functional genomic data, popular statistical prediction models can sharply underestimate the probability of a regulatory effect. We describe here the presence-only model (PO-EN), a type of semisupervised model, to predict regulatory effects of genetic variants at sequence-level resolution in a context of interest by integrating a large number of epigenetic features and massively parallel reporter assays (MPRAs). Results Using experimental data from a variety of MPRAs we show that the presence-only model produces better calibrated predicted probabilities and has increased accuracy relative to state-of-the-art prediction models. Furthermore, we show that the predictions based on pretrained PO-EN models are useful for prioritizing functional variants among candidate eQTLs and significant SNPs at GWAS loci. In particular, for the costimulatory locus, associated with multiple autoimmune diseases, we show evidence of a regulatory variant residing in an enhancer 24.4 kb downstream of CTLA4, with evidence from capture Hi-C of interaction with CTLA4. Furthermore, the risk allele of the regulatory variant is on the same risk increasing haplotype as a functional coding variant in exon 1 of CTLA4, suggesting that the regulatory variant acts jointly with the coding variant leading to increased risk to disease. Availability and implementation The presence-only model is implemented in the R package ‘PO.EN’, freely available on CRAN. A vignette describing a detailed demonstration of using the proposed PO-EN model can be found on github at https://github.com/Iuliana-Ionita-Laza/PO.EN/ Supplementary information Supplementary data are available at Bioinformatics online.

2020 ◽  
Vol 36 (Supplement_1) ◽  
pp. i194-i202
Author(s):  
Berk A Alpay ◽  
Pinar Demetci ◽  
Sorin Istrail ◽  
Derek Aguiar

Abstract Motivation Genome-wide association studies (GWAS) have discovered thousands of significant genetic effects on disease phenotypes. By considering gene expression as the intermediary between genotype and disease phenotype, expression quantitative trait loci studies have interpreted many of these variants by their regulatory effects on gene expression. However, there remains a considerable gap between genotype-to-gene expression association and genotype-to-gene expression prediction. Accurate prediction of gene expression enables gene-based association studies to be performed post hoc for existing GWAS, reduces multiple testing burden, and can prioritize genes for subsequent experimental investigation. Results In this work, we develop gene expression prediction methods that relax the independence and additivity assumptions between genetic markers. First, we consider gene expression prediction from a regression perspective and develop the HAPLEXR algorithm which combines haplotype clusterings with allelic dosages. Second, we introduce the new gene expression classification problem, which focuses on identifying expression groups rather than continuous measurements; we formalize the selection of an appropriate number of expression groups using the principle of maximum entropy. Third, we develop the HAPLEXD algorithm that models haplotype sharing with a modified suffix tree data structure and computes expression groups by spectral clustering. In both models, we penalize model complexity by prioritizing genetic clusters that indicate significant effects on expression. We compare HAPLEXR and HAPLEXD with three state-of-the-art expression prediction methods and two novel logistic regression approaches across five GTEx v8 tissues. HAPLEXD exhibits significantly higher classification accuracy overall; HAPLEXR shows higher prediction accuracy on approximately half of the genes tested and the largest number of best predicted genes (r2>0.1) among all methods. We show that variant and haplotype features selected by HAPLEXR are smaller in size than competing methods (and thus more interpretable) and are significantly enriched in functional annotations related to gene regulation. These results demonstrate the importance of explicitly modeling non-dosage dependent and intragenic epistatic effects when predicting expression. Availability and implementation Source code and binaries are freely available at https://github.com/rapturous/HAPLEX. Supplementary information Supplementary data are available at Bioinformatics online.


PLoS ONE ◽  
2019 ◽  
Vol 14 (6) ◽  
pp. e0218073 ◽  
Author(s):  
Rajiv Movva ◽  
Peyton Greenside ◽  
Georgi K. Marinov ◽  
Surag Nair ◽  
Avanti Shrikumar ◽  
...  

2019 ◽  
Vol 15 ◽  
pp. P628-P628
Author(s):  
Karen Nuytemans ◽  
Derek J. van Booven ◽  
Natalia K. Hofmann ◽  
Farid Rajabli ◽  
Anthony J. Griswold ◽  
...  

2018 ◽  
Author(s):  
Rajiv Movva ◽  
Peyton Greenside ◽  
Georgi K. Marinov ◽  
Surag Nair ◽  
Avanti Shrikumar ◽  
...  

AbstractThe relationship between noncoding DNA sequence and gene expression is not well-understood. Massively parallel reporter assays (MPRAs), which quantify the regulatory activity of large libraries of DNA sequences in parallel, are a powerful approach to characterize this relationship. We present MPRA-DragoNN, a convolutional neural network (CNN)-based framework to predict and interpret the regulatory activity of DNA sequences as measured by MPRAs. While our method is generally applicable to a variety of MPRA designs, here we trained our model on the Sharpr-MPRA dataset that measures the activity of ~500,000 constructs tiling 15,720 regulatory regions in human K562 and HepG2 cell lines. MPRA-DragoNN predictions were moderately correlated (Spearman ρ = 0.28) with measured activity and were within range of replicate concordance of the assay. State-of-the-art model interpretation methods revealed high-resolution predictive regulatory sequence features that overlapped transcription factor (TF) binding motifs. We used the model to investigate the cell type and chromatin state preferences of predictive TF motifs. We explored the ability of our model to predict the allelic effects of regulatory variants in an independent MPRA experiment and fine map putative functional SNPs in loci associated with lipid traits. Our results suggest that interpretable deep learning models trained on MPRA data have the potential to reveal meaningful patterns in regulatory DNA sequences and prioritize regulatory genetic variants, especially as larger, higher-quality datasets are produced.


2022 ◽  
Vol 8 ◽  
Author(s):  
Senlin Hu ◽  
Dong Hu ◽  
Haoran Wei ◽  
Shi-yang Li ◽  
Dong Wang ◽  
...  

Background: Genetic variants in Scavenger receptor Class B Type 1 (SCARB1) influencing high-density lipoprotein cholesterol (HDL-C) and coronary heart disease (CHD) risk were identified by recent genome-wide association studies. Further study of potential functional variants in SCARB1 may provide new ideas of the complicated relationship between HDL-C and CHD.Methods: 2000 bp in SCARB1 promoter region was re-sequenced in 168 participants with extremely high plasma HDL-C and 400 control subjects. Putative risk alleles were identified using bioinformatics analysis and reporter-gene assays. Two indel variants, rs144334493 and rs557348251, respectively, were genotyped in 5,002 CHD patients and 5,175 control subjects. The underlying mechanisms were investigated.Results: Through resequencing, 27 genetic variants were identified. Results of genotyping in 5,002 CHD patients and 5,175 control subjects revealed that rs144334493 and rs557348251 were significantly associated with increased risk of CHD [odds ratio (OR): 1.28, 95% confidence interval (CI): 1.09 to 1.52, p = 0.003; OR: 2.65, 95% CI: 1.66–4.24, p = 4.4 × 10−5). Subsequent mechanism experiments demonstrated that rs144334493 deletion allele attenuated forkhead box A1 (FOXA1) binding to the promoter region of SCARB1, while FOXA1 overexpression reversely increased SR-BI expression.Conclusion: Genetic variants in SCARB1 promoter region significantly associated with the plasma lipid levels by affecting SR-BI expression and contribute to the susceptibility of CHD.


2020 ◽  
Author(s):  
Minjun Park ◽  
Salvi Singh ◽  
Francisco Jose Grisanti Canozo ◽  
Md. Abul Hassan Samee

AbstractMassively parallel reporter assays (MPRAs) have enabled the study of transcriptional regulatory mechanisms at an unprecedented scale and with high quantitative resolution. However, this realm lacks models that can discover sequence-specific signals de novo from the data and integrate them in a mechanistic way. We present MuSeAM (Multinomial CNNs for Sequence Activity Modeling), a convolutional neural network that overcomes this gap. MuSeAM utilizes multinomial convolutions that directly model sequence-specific motifs of protein-DNA binding. We demonstrate that MuSeAM fits MPRA data with high accuracy and generalizes over other tasks such as predicting chromatin accessibility and prioritizing potentially functional variants.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Bernard Mulvey ◽  
Joseph D. Dougherty

AbstractFamily and population studies indicate clear heritability of major depressive disorder (MDD), though its underlying biology remains unclear. The majority of single-nucleotide polymorphism (SNP) linkage blocks associated with MDD by genome-wide association studies (GWASes) are believed to alter transcriptional regulators (e.g., enhancers, promoters) based on enrichment of marks correlated with these functions. A key to understanding MDD pathophysiology will be elucidation of which SNPs are functional and how such functional variants biologically converge to elicit the disease. Furthermore, retinoids can elicit MDD in patients and promote depressive-like behaviors in rodent models, acting via a regulatory system of retinoid receptor transcription factors (TFs). We therefore sought to simultaneously identify functional genetic variants and assess retinoid pathway regulation of MDD risk loci. Using Massively Parallel Reporter Assays (MPRAs), we functionally screened over 1000 SNPs prioritized from 39 neuropsychiatric trait/disease GWAS loci, selecting SNPs based on overlap with predicted regulatory features—including expression quantitative trait loci (eQTL) and histone marks—from human brains and cell cultures. We identified >100 SNPs with allelic effects on expression in a retinoid-responsive model system. Functional SNPs were enriched for binding sequences of retinoic acid-receptive transcription factors (TFs), with additional allelic differences unmasked by treatment with all-trans retinoic acid (ATRA). Finally, motifs overrepresented across functional SNPs corresponded to TFs highly specific to serotonergic neurons, suggesting an in vivo site of action. Our application of MPRAs to screen MDD-associated SNPs suggests a shared transcriptional-regulatory program across loci, a component of which is unmasked by retinoids.


2019 ◽  
Author(s):  
Jiyeon Choi ◽  
Tongwu Zhang ◽  
Andrew Vu ◽  
Julien Ablain ◽  
Matthew M Makowski ◽  
...  

AbstractGenome-wide association studies (GWAS) have identified ∼20 melanoma susceptibility loci. To identify susceptibility genes and variants simultaneously from multiple GWAS loci, we integrated massively-parallel reporter assays (MPRA) with cell type-specific epigenomic data as well as melanocyte-specific expression quantitative trait loci (eQTL) profiling. Starting from 16 melanoma loci, we selected 832 variants overlapping active regions of chromatin in cells of melanocytic lineage and identified 39 candidate functional variants displaying allelic transcriptional activity by MPRA. For four of these loci, we further identified four colocalizing melanocytecis-eQTL genes (CTSS,CASP8,MX2, andMAFF) matching the allelic activity of MPRA functional variants. Among these, we further characterized the locus encompassing the HIV-1 restriction gene,MX2, on chromosome band Chr21q22.3 and validated a functional variant, rs398206, among multiple high LD variants. rs398206 mediates allelic transcriptional activity via binding of the transcription factor, YY1. This allelic transcriptional regulation is consistent with a significantcis-eQTL ofMX2in primary human melanocytes, where the melanoma risk-associated A allele of rs398206 is correlated with higherMX2levels. Melanocyte-specific transgenic expression of humanMX2in a zebrafish model demonstrated accelerated melanoma formation in aBRAFV600Ebackground. Thus, using an efficient scalable approach to streamline GWAS follow-up functional studies, we identified multiple candidate melanoma susceptibility genes and variants, and uncovered a pleiotropic function ofMX2in melanoma susceptibility.


2021 ◽  
Author(s):  
Bernard Mulvey ◽  
Joseph D. Dougherty

ABSTRACTFamily and population studies indicate clear heritability of major depressive disorder (MDD), though its underlying biology remains unclear. The majority of single-nucleotide polymorphism (SNP) linkage blocks associated with MDD by genome-wide association studies (GWASes) are believed to alter transcriptional regulators (e.g., enhancers, promoters), based on enrichment of marks correlated with these functions. A key to understanding MDD pathophysiology will be elucidation of which SNPs are functional and how such functional variants biologically converge to elicit the disease. Furthermore, retinoids can elicit MDD in patients, and promote depressive behaviors in rodent models, acting via a regulatory system of retinoid receptor transcription factors (TFs). We therefore sought to simultaneously identify functional genetic variants and assess retinoid pathway regulation of MDD risk loci. Using Massively Parallel Reporter Assays (MPRAs), we functionally screened over 1 000 SNPs prioritized from 39 neuropsychiatric trait/disease GWAS loci, with SNPs selected based on overlap with predicted regulatory features—including expression quantitative trait loci (eQTL) and histone marks—from human brains and cell cultures. We identified >100 SNPs with allelic effects on expression in a retinoid-responsive model system. Further, functional SNPs were enriched for binding sequences of retinoic acid-receptive transcription factors (TFs); with additional allelic differences unmasked by treatment with all-trans retinoic acid (ATRA). Finally, motifs overrepresented across functional SNPs corresponded to TFs highly specific to serotonergic neurons, suggesting an in vivo site of action. Our application of MPRAs to screen MDD-associated SNPs suggests a shared transcriptional regulatory program across loci, a subset of which are unmasked by retinoids.


Sign in / Sign up

Export Citation Format

Share Document