scholarly journals Impact of variant-level batch effects on identification of genetic risk factors in large sequencing studies

PLoS ONE ◽  
2021 ◽  
Vol 16 (4) ◽  
pp. e0249305
Author(s):  
Daniel P. Wickland ◽  
Yingxue Ren ◽  
Jason P. Sinnwell ◽  
Joseph S. Reddy ◽  
Cyril Pottier ◽  
...  

Genetic studies have shifted to sequencing-based rare variants discovery after decades of success in identifying common disease variants by Genome-Wide Association Studies using Single Nucleotide Polymorphism chips. Sequencing-based studies require large sample sizes for statistical power and therefore often inadvertently introduce batch effects because samples are typically collected, processed, and sequenced at multiple centers. Conventionally, batch effects are first detected and visualized using Principal Components Analysis and then controlled by including batch covariates in the disease association models. For sequencing-based genetic studies, because all variants included in the association analyses have passed sequencing-related quality control measures, this conventional approach treats every variant as equal and ignores the substantial differences still remaining in variant qualities and characteristics such as genotype quality scores, alternative allele fractions (fraction of reads supporting alternative allele at a variant position) and sequencing depths. In the Alzheimer’s Disease Sequencing Project (ADSP) exome dataset of 9,904 cases and controls, we discovered hidden variant-level differences between sample batches of three sequencing centers and two exome capture kits. Although sequencing centers were included as a covariate in our association models, we observed differences at the variant level in genotype quality and alternative allele fraction between samples processed by different exome capture kits that significantly impacted both the confidence of variant detection and the identification of disease-associated variants. Furthermore, we found that a subset of top disease-risk variants came exclusively from samples processed by one exome capture kit that was more effective at capturing the alternative alleles compared to the other kit. Our findings highlight the importance of additional variant-level quality control for large sequencing-based genetic studies. More importantly, we demonstrate that automatically filtering out variants with batch differences may lead to false negatives if the batch discordances come largely from quality differences and if the batch-specific variants have better quality.

2020 ◽  
Author(s):  
Daniel P Wickland ◽  
Yingxue Ren ◽  
Jason P Sinnwell ◽  
Joseph S Reddy ◽  
Cyril Pottier ◽  
...  

Abstract Background: Genetic studies have shifted to sequencing-based rare variants discovery after decades of success in identifying common disease variants by Genome-Wide Association Studies using Single Nucleotide Polymorphism chips. Sequencing-based studies require large sample sizes for statistical power but often inadvertently introduce batch effects because samples are typically collected, processed, and sequenced at multiple centers. Conventionally, batch effects are first detected and visualized using Principal Components Analysis and then controlled by including batch covariates in the disease association models. For sequencing-based genetic studies, because all variants included in the association analyses have passed quality control measures, this conventional approach treats every variant as equal and ignores the substantial differences still remaining in variant qualities and characteristics such as genotype quality scores, alternative allele fractions (fraction of reads supporting alternative allele at a variant position) and sequencing depths. Results: In the Alzheimer’s Disease Sequencing Project (ADSP) exome dataset of 9,904 cases and controls, we discovered hidden variant-level differences between sample batches of three sequencing centers and two exome capture kits. Although sequencing centers were included as a covariate in our association models, we observed differences at the variant level in genotype quality and alternative allele fraction between samples processed by different exome capture kits that significantly impacted both the confidence of variant detection and the identification of disease-associated variants. Furthermore, we found that the association signals of a subset of top disease risk variants came exclusively from samples processed by one exome capture kit that was more effective at capturing the alternative alleles compared to the other kit. Conclusions: Our findings highlight the importance of additional variant-level quality control for large sequencing-based genetic studies. More importantly, we demonstrate that automatically filtering out variants with batch differences may lead to false negatives if the batch discordance came largely from quality differences and if the variants from one batch had better quality scores.


2021 ◽  
Author(s):  
Steven Gazal ◽  
Omer Weissbrod ◽  
Farhad Hormozdiari ◽  
Kushal Dey ◽  
Joseph Nasser ◽  
...  

Although genome-wide association studies (GWAS) have identified thousands of disease-associated common SNPs, these SNPs generally do not implicate the underlying target genes, as most disease SNPs are regulatory. Many SNP-to-gene (S2G) linking strategies have been developed to link regulatory SNPs to the genes that they regulate in cis, but it is unclear how these strategies should be applied in the context of interpreting common disease risk variants. We developed a framework for evaluating and combining different S2G strategies to optimize their informativeness for common disease risk, leveraging polygenic analyses of disease heritability to define and estimate their precision and recall. We applied our framework to GWAS summary statistics for 63 diseases and complex traits (average N=314K), evaluating 50 S2G strategies. Our optimal combined S2G strategy (cS2G) included 7 constituent S2G strategies (Exon, Promoter, 2 fine-mapped cis-eQTL strategies, EpiMap enhancer-gene linking, Activity-By-Contact (ABC), and Cicero), and achieved a precision of 0.75 and a recall of 0.33, more than doubling the precision and/or recall of any individual strategy; this implies that 33% of SNP-heritability can be linked to causal genes with 75% confidence. We applied cS2G to fine-mapping results for 49 UK Biobank diseases/traits to predict 7,111 causal SNP-gene-disease triplets (with S2G-derived functional interpretation) with high confidence. Finally, we applied cS2G to genome-wide fine-mapping results for these traits (not restricted to GWAS loci) to rank genes by the heritability linked to each gene, providing an empirical assessment of disease omnigenicity; averaging across traits, we determined that the top 200 (1%) of ranked genes explained roughly half of the heritability linked to all genes. Our results highlight the benefits of our cS2G strategy in providing functional interpretation of GWAS findings; we anticipate that precision and recall will increase further under our framework as improved functional assays lead to improved S2G strategies. 


2010 ◽  
Vol 34 (8) ◽  
pp. 854-862 ◽  
Author(s):  
Sang Hong Lee ◽  
Dale R. Nyholt ◽  
Stuart Macgregor ◽  
Anjali K. Henders ◽  
Krina T. Zondervan ◽  
...  

2016 ◽  
Author(s):  
Sara L. Pulit ◽  
Sera A.J. de With ◽  
Paul I.W. de Bakker

AbstractGenome-wide association studies (GWAS) of common disease have been hugely successful in implicating loci that modify disease risk. The bulk of these associations have proven robust and reproducible, in part due to community adoption of statistical criteria for claiming significant genotype-phenotype associations. Currently, studies of common disease are rapidly shifting towards the use of sequencing technologies. As the cost of sequencing drops, assembling large samples in global populations is becoming increasingly feasible. Sequencing studies interrogate not only common variants, as was true for genotyping-based GWAS, but variation across the full allele frequency spectrum, yielding many more (independent) statistical tests. We sought to empirically determine genome-wide significance for various analysis scenarios. Using whole-genome sequence data, we simulated sequencing-based disease studies of varying sample size and ancestry. We determined that future sequencing efforts in >2,000 samples should practically employ a genome-wide significance threshold of of p <5 ×10−9, though the threshold does vary with ancestry. Studies of European or East Asian ancestry should set genome-wide significance at approximately p <5×10−9, but similar studies of African or South Asian samples should be more stringent (p <1×10−9). Because sequencing analysis brings with it many challenges (especially for rare variants), appropriate adoption of a revised multiple test correction will be crucial to avoid irreproducible claims of association.


Author(s):  
Joseph Nasser ◽  
Drew T. Bergman ◽  
Charles P. Fulco ◽  
Philine Guckelberger ◽  
Benjamin R. Doughty ◽  
...  

AbstractGenome-wide association studies have now identified tens of thousands of noncoding loci associated with human diseases and complex traits, each of which could reveal insights into biological mechanisms of disease. Many of the underlying causal variants are thought to affect enhancers, but we have lacked genome-wide maps of enhancer-gene regulation to interpret such variants. We previously developed the Activity-by-Contact (ABC) Model to predict enhancer-gene connections and demonstrated that it can accurately predict the results of CRISPR perturbations across several cell types. Here, we apply this ABC Model to create enhancer-gene maps in 131 cell types and tissues, and use these maps to interpret the functions of fine-mapped GWAS variants. For inflammatory bowel disease (IBD), causal variants are >20-fold enriched in enhancers in particular cell types, and ABC outperforms other regulatory methods at connecting noncoding variants to target genes. Across 72 diseases and complex traits, ABC links 5,036 GWAS signals to 2,249 unique genes, including a class of 577 genes that appear to influence multiple phenotypes via variants in enhancers that act in different cell types. Guided by these variant-to-function maps, we show that an enhancer containing an IBD risk variant regulates the expression of PPIF to tune mitochondrial membrane potential. Together, our study reveals insights into principles of genome regulation, illuminates mechanisms that influence IBD, and demonstrates a generalizable strategy to connect common disease risk variants to their molecular and cellular functions.


2016 ◽  
Vol 119 (suppl_1) ◽  
Author(s):  
Aditya Kumar ◽  
Stephanie Thomas ◽  
Kirsten Wong ◽  
Kevin Tenerelli ◽  
Valentina Lo Sardo ◽  
...  

Genome-wide association studies have identified single nucleotide polymorphisms (SNPs) at gene loci that affect cardiovascular function, and while mechanisms in protein-coding loci are obvious, those in non-coding loci are difficult to determine. 9p21 is a recently identified locus associated with increased risk of coronary artery disease (CAD) and myocardial infarction. Associations have implicated SNPs in altering smooth muscle and endothelial cell properties but have not identified adverse effects in cardiomyocytes (CMs) despite enhanced disease risk. Using induced pluripotent stem cell-derived CMs from patients that are homozygous risk/risk (R/R) and non-risk/non-risk (N/N) for 9p21 SNPs and either CAD positive or negative, we assessed CM function when cultured on hydrogels capable of mimicking the fibrotic stiffening associated with disease post-heart attack, i.e. “heart attack-in-a-dish” stiffening from 11 kiloPascals (kPa) to 50 kPa. While all CMs independent of genotype and disease beat synchronously on soft matrices, R/R CMs cultured on dynamically stiffened hydrogels exhibited asynchronous contractions and had significantly lower correlation coefficients versus N/N CMs in the same conditions. Dynamic stiffening reduced connexin 43 expression and gap junction assembly in R/R CMs but not N/N CMs. To eliminate patient-to-patient variability, we created an isogenic line by deleting the 9p21 gene locus from a R/R patient using TALEN-mediated gene editing, i.e. R/R KO. Deletion of the 9p21 locus restored synchronous contractility and organized connexin 43 junctions. As a non-coding locus, 9p21 appears to repress connexin transcription, leading to the phenotypes we observe, but only when the niche is stiffened as in disease. These data are the first to demonstrate that disease-specific niche remodeling, e.g. a “heart attack-in-a-dish” model, can differentially affect CM function depending on SNPs within a non-coding locus.


2021 ◽  
Vol 12 ◽  
Author(s):  
Valeria Orrù ◽  
Maristella Steri ◽  
Francesco Cucca ◽  
Edoardo Fiorillo

In recent years, systematic genome-wide association studies of quantitative immune cell traits, represented by circulating levels of cell subtypes established by flow cytometry, have revealed numerous association signals, a large fraction of which overlap perfectly with genetic signals associated with autoimmune diseases. By identifying further overlaps with association signals influencing gene expression and cell surface protein levels, it has also been possible, in several cases, to identify causal genes and infer candidate proteins affecting immune cell traits linked to autoimmune disease risk. Overall, these results provide a more detailed picture of how genetic variation affects the human immune system and autoimmune disease risk. They also highlight druggable proteins in the pathogenesis of autoimmune diseases; predict the efficacy and side effects of existing therapies; provide new indications for use for some of them; and optimize the research and development of new, more effective and safer treatments for autoimmune diseases. Here we review the genetic-driven approach that couples systematic multi-parametric flow cytometry with high-resolution genetics and transcriptomics to identify endophenotypes of autoimmune diseases for the development of new therapies.


2019 ◽  
Author(s):  
Jing Yang ◽  
Amanda McGovern ◽  
Paul Martin ◽  
Kate Duffus ◽  
Xiangyu Ge ◽  
...  

AbstractGenome-wide association studies have identified genetic variation contributing to complex disease risk. However, assigning causal genes and mechanisms has been more challenging because disease-associated variants are often found in distal regulatory regions with cell-type specific behaviours. Here, we collect ATAC-seq, Hi-C, Capture Hi-C and nuclear RNA-seq data in stimulated CD4+ T-cells over 24 hours, to identify functional enhancers regulating gene expression. We characterise changes in DNA interaction and activity dynamics that correlate with changes gene expression, and find that the strongest correlations are observed within 200 kb of promoters. Using rheumatoid arthritis as an example of T-cell mediated disease, we demonstrate interactions of expression quantitative trait loci with target genes, and confirm assigned genes or show complex interactions for 20% of disease associated loci, including FOXO1, which we confirm using CRISPR/Cas9.


Sign in / Sign up

Export Citation Format

Share Document