Advantages of genotype imputation with ethnically matched reference panel for rare variant association analyses

AbstractGenotype imputation has become a standard procedure prior genome-wide association studies (GWASs). For common and low-frequency variants, genotype imputation can be performed sufficiently accurately with publicly available and ethnically heterogeneous reference datasets like 1000 Genomes Project (1000G) and Haplotype Reference Consortium panels. However, the imputation of rare variants has been shown to be significantly more accurate when ethnically matched reference panel is used. Even more, greater genetic similarity between reference panel and target samples facilitates the detection of rare (or even population-specific) causal variants. Notwithstanding, the genome-wide downstream consequences and differences of using ethnically mixed and matched reference panels have not been yet comprehensively explored.We determined and quantified these differences by performing several comparative evaluations of the discovery-driven analysis scenarios. A variant-wise GWAS was performed on seven complex diseases and body mass index by using genome-wide genotype data of ∼37,000 Estonians imputed with ethnically mixed 1000G and ethnically matched imputation reference panels. Although several previously reported common (minor allele frequency; MAF > 5%) variant associations were replicated in both resulting imputed datasets, no major differences were observed among the genome-wide significant findings or in the fine-mapping effort. In the analysis of rare (MAF < 1%) coding variants, 46 significantly associated genes were identified in the ethnically matched imputed data as compared to four genes in the 1000G panel based imputed data. All resulting genes were consequently studied in the UK Biobank data.These associations provide a solid example of how rare variants can be efficiently analysed to discover novel, potentially functional genetic variants in relevant phenotypes. Furthermore, our work serves as proof of a cost-efficient study design, demonstrating that the usage of ethnically matched imputation reference panels can enable substantially improved imputation of rare variants, facilitating novel high-confidence findings in rare variant GWAS scans.Author summaryOver the last decade, genome-wide association studies (GWASs) have been widely used for detecting genetic biomarkers in a wide range of traits. Typically, GWASs are carried out using chip-based genotyping data, which are then combined with a more densely genotyped reference panel to infer untyped genetic variants in chip-typed individuals. The latter method is called genotype imputation and its accuracy depends on multiple factors. Publicly available and ethnically heterogeneous imputation reference panels (IRPs) such as 1000 Genomes Project (1000G) are sufficiently accurate for imputation of common and low-frequency variants, but custom ethnically matched IRPs outperform these in case of rare variants. In this work, we systematically compare downstream association analysis effects on eight complex traits in ∼37,000 Estonians imputed with ethnically mixed and ethnically matched IRPs. We do not observe major differences in the single variant analysis, where both imputed datasets replicate previously reported significant loci. But in the gene-based analysis of rare protein-coding variants we show that ethnically matched panel clearly outperforms 1000G panel based imputation, providing 10-fold increase in significant gene-trait associations. Our study demonstrates empirically that imputed data based on ethnically matched panel is very promising for rare variant analysis – it captures more population-specific variants and makes it possible to efficiently identify novel findings.

Download Full-text

Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program

Nature ◽

10.1038/s41586-021-03205-y ◽

2021 ◽

Vol 590 (7845) ◽

pp. 290-299 ◽

Cited By ~ 22

Author(s):

Daniel Taliun ◽

◽

Daniel N. Harris ◽

Michael D. Kessler ◽

Jedidiah Carlson ◽

...

Keyword(s):

Rare Variants ◽

Sequence Data ◽

Association Studies ◽

Genotype Imputation ◽

Genome Wide Association Studies ◽

Phenotypic Data ◽

Treatment And Prevention ◽

Genome Wide ◽

Diverse Backgrounds ◽

Unmapped Reads

AbstractThe Trans-Omics for Precision Medicine (TOPMed) programme seeks to elucidate the genetic architecture and biology of heart, lung, blood and sleep disorders, with the ultimate goal of improving diagnosis, treatment and prevention of these diseases. The initial phases of the programme focused on whole-genome sequencing of individuals with rich phenotypic data and diverse backgrounds. Here we describe the TOPMed goals and design as well as the available resources and early insights obtained from the sequence data. The resources include a variant browser, a genotype imputation server, and genomic and phenotypic data that are available through dbGaP (Database of Genotypes and Phenotypes)1. In the first 53,831 TOPMed samples, we detected more than 400 million single-nucleotide and insertion or deletion variants after alignment with the reference genome. Additional previously undescribed variants were detected through assembly of unmapped reads and customized analysis in highly variable loci. Among the more than 400 million detected variants, 97% have frequencies of less than 1% and 46% are singletons that are present in only one individual (53% among unrelated individuals). These rare variants provide insights into mutational processes and recent human evolutionary history. The extensive catalogue of genetic variation in TOPMed studies provides unique opportunities for exploring the contributions of rare and noncoding sequence variants to phenotypic variation. Furthermore, combining TOPMed haplotypes with modern imputation methods improves the power and reach of genome-wide association studies to include variants down to a frequency of approximately 0.01%.

Download Full-text

Assessing the contribution of rare-to-common protein-coding variants to circulating metabolic biomarker levels via 412,394 UK Biobank exome sequences

10.1101/2021.12.24.21268381 ◽

2021 ◽

Author(s):

Abhishek Nag ◽

Lawrence Middleton ◽

Ryan S Dhindsa ◽

Dimitrios Vitsios ◽

Eleanor M Wigmore ◽

...

Keyword(s):

Gene Networks ◽

Rare Variants ◽

Association Studies ◽

Low Frequency ◽

Genome Wide Association Studies ◽

Uk Biobank ◽

Protein Coding ◽

The Uk ◽

Metabolic Biomarkers ◽

Coding Variants

Genome-wide association studies have established the contribution of common and low frequency variants to metabolic biomarkers in the UK Biobank (UKB); however, the role of rare variants remains to be assessed systematically. We evaluated rare coding variants for 198 metabolic biomarkers, including metabolites assayed by Nightingale Health, using exome sequencing in participants from four genetically diverse ancestries in the UKB (N=412,394). Gene-level collapsing analysis, that evaluated a range of genetic architectures, identified a total of 1,303 significant relationships between genes and metabolic biomarkers (p<1x10-8), encompassing 207 distinct genes. These include associations between rare non-synonymous variants in GIGYF1 and glucose and lipid biomarkers, SYT7 and creatinine, and others, which may provide insights into novel disease biology. Comparing to a previous microarray-based genotyping study in the same cohort, we observed that 40% of gene-biomarker relationships identified in the collapsing analysis were novel. Finally, we applied Gene-SCOUT, a novel tool that utilises the gene-biomarker association statistics from the collapsing analysis to identify genes having similar biomarker fingerprints and thus expand our understanding of gene networks.

Download Full-text

Common and Rare Variants in Genes Associated with von Willebrand Factor Level Variation: No Accumulation of Rare Variants in Swedish von Willebrand Disease Patients

TH Open ◽

10.1055/s-0040-1718885 ◽

2020 ◽

Vol 04 (04) ◽

pp. e322-e331

Author(s):

Eric Manderstedt ◽

Christina Lind-Halldén ◽

Stefan Lethagen ◽

Christer Halldén

Keyword(s):

Von Willebrand Factor ◽

Rare Variants ◽

Association Studies ◽

Low Frequency ◽

Von Willebrand Disease ◽

Genome Wide Association Studies ◽

Genome Wide ◽

The Common ◽

Von Willebrand ◽

Willebrand Factor

AbstractGenome-wide association studies (GWASs) have identified genes that affect plasma von Willebrand factor (VWF) levels. ABO showed a strong effect, whereas smaller effects were seen for VWF, STXBP5, STAB2, SCARA5, STX2, TC2N, and CLEC4M. This study screened comprehensively for both common and rare variants in these eight genes by resequencing their coding sequences in 104 Swedish von Willebrand disease (VWD) patients. The common variants previously associated with the VWF level were all accumulated in the VWD patients compared to three control populations. The strongest effect was detected for blood group O coded for by the ABO gene (71 vs. 38% of genotypes). The other seven VWF level associated alleles were enriched in the VWD population compared to control populations, but the differences were small and not significant. The sequencing detected a total of 146 variants in the eight genes. Excluding 70 variants in VWF, 76 variants remained. Of the 76 variants, 54 had allele frequencies > 0.5% and have therefore been investigated for their association with the VWF level in previous GWAS. The remaining 22 variants with frequencies < 0.5% are less likely to have been evaluated previously. PolyPhen2 classified 3 out of the 22 variants as probably or possibly damaging (two in STAB2 and one in STX2); the others were either synonymous or benign. No accumulation of low frequency (0.05–0.5%) or rare variants (<0.05%) in the VWD population compared to the gnomAD (Genome Aggregation Database) population was detected. Thus, rare variants in these genes do not contribute to the low VWF levels observed in VWD patients.

Download Full-text

Sequencing of over 100,000 individuals identifies multiple genes and rare variants associated with Crohns disease susceptibility

10.1101/2021.06.15.21258641 ◽

2021 ◽

Author(s):

Aleksejs Sazonovs ◽

Christine R Stevens ◽

Guhan R Venkataraman ◽

Kai Yuan ◽

Brandon Avila ◽

...

Keyword(s):

Rare Variants ◽

Disease Risk ◽

Sequence Data ◽

Association Studies ◽

Genome Wide Association Studies ◽

Crohns Disease ◽

Biological Targets ◽

Genome Wide ◽

Coding Variants ◽

First Time

Genome-wide association studies (GWAS) have identified hundreds of loci associated with Crohns disease (CD); however, as with all complex diseases, deriving pathogenic mechanisms from these non-coding GWAS discoveries has been challenging. To complement GWAS and better define actionable biological targets, we analysed sequence data from more than 30,000 CD cases and 80,000 population controls. We observe rare coding variants in established CD susceptibility genes as well as ten genes where coding variation directly implicates the gene in disease risk for the first time.

Download Full-text

Deep genotype imputation captures virtually all heritability of autoimmune vitiligo

Human Molecular Genetics ◽

10.1093/hmg/ddaa005 ◽

2020 ◽

Vol 29 (5) ◽

pp. 859-863 ◽

Cited By ~ 3

Author(s):

Genevieve H L Roberts ◽

Stephanie A Santorico ◽

Richard A Spritz

Keyword(s):

Complex Disease ◽

Rare Variants ◽

Association Studies ◽

Genotype Imputation ◽

Genome Wide Association Studies ◽

Common Variants ◽

Genome Wide ◽

Autoimmune Vitiligo ◽

Family Based ◽

Project Data

Abstract Autoimmune vitiligo is a complex disease involving polygenic risk from at least 50 loci previously identified by genome-wide association studies. The objectives of this study were to estimate and compare vitiligo heritability in European-derived patients using both family-based and ‘deep imputation’ genotype-based approaches. We estimated family-based heritability (h2FAM) by vitiligo recurrence among a total 8034 first-degree relatives (3776 siblings, 4258 parents or offspring) of 2122 unrelated vitiligo probands. We estimated genotype-based heritability (h2SNP) by deep imputation to Haplotype Reference Consortium and the 1000 Genomes Project data in unrelated 2812 vitiligo cases and 37 079 controls genotyped genome wide, achieving high-quality imputation from markers with minor allele frequency (MAF) as low as 0.0001. Heritability estimated by both approaches was exceedingly high; h2FAM = 0.75–0.83 and h2SNP = 0.78. These estimates are statistically identical, indicating there is essentially no remaining ‘missing heritability’ for vitiligo. Overall, ~70% of h2SNP is represented by common variants (MAF > 0.01) and 30% by rare variants. These results demonstrate that essentially all vitiligo heritable risk is captured by array-based genotyping and deep imputation. These findings suggest that vitiligo may provide a particularly tractable model for investigation of complex disease genetic architecture and predictive aspects of personalized medicine.

Download Full-text

Genotype Imputation Performance of Three Reference Panels Using African Ancestry Individuals

10.1101/245035 ◽

2018 ◽

Author(s):

Candelaria Vergara ◽

Margaret M. Parker ◽

Liliana Franco ◽

Michael H. Cho ◽

Ana V. Valencia-Duarte ◽

...

Keyword(s):

High Performance ◽

Association Studies ◽

African Ancestry ◽

Imputation Accuracy ◽

Low Frequency ◽

Genotype Imputation ◽

European Ancestry ◽

Genome Wide Association Studies ◽

Multiple Sources ◽

Genome Wide

ABSTRACTGenotype imputation is used to estimate unobserved genotypes from genome-wide maker data, to increase genome coverage and power for genome-wide association studies. Imputation has been most successful for European ancestry populations in which very large reference panels are available. Smaller subsets of African descent populations are available in 1000 Genomes (1000G), the Consortium on Asthma among African-Ancestry Populations in the Americas (CAAPA) and the Haplotype Reference Consortium (HRC). We aimed to compare the performance of these reference panels when imputing variation in 3,747 African Americans (AA) from 2 cohorts (HCV and COPDGene) genotyped using the Illumina Omni family of microarrays. The haplotypes of 2,504 individuals (from 1000G), 883 (from CAAPA) and 32,611 (from HRC) were used as reference. We compared the performance of these panels based on number of variants, imputation quality, imputation accuracy and coverage. In both cohorts, 1000G imputed 1.5–1.6x more variants compared to CAAPA and 1.2x more variants than HRC. Similar findings were observed for variants with higher imputation quality (R2>0.5) and for rare, low frequency, and common variants. When merging the results of the three panels the total number of imputed variants was 62M-63M with 20M overlapping variants imputed by all three panels, and a range of 5 to 15M unique variants imputed exclusively with one of the three panels. For overlapping variants, imputation quality was highest for HRC, followed by 1000G, then CAAPA, and improved as the minor allele frequency increased. The 1000G, HRC and CAAPA participants of African ancestry provided high performance and accuracy for imputation of African American admixed individuals, increasing the total number of variants with high quality available for subsequent analyses. These three panels are complementary and would benefit from the development of an integrated African reference panel, including data from multiple sources and populations.

Download Full-text

A one penny imputed genome from next generation reference panels

10.1101/357806 ◽

2018 ◽

Cited By ~ 1

Author(s):

Brian L. Browning ◽

Ying Zhou ◽

Sharon R. Browning

Keyword(s):

Association Studies ◽

Computational Cost ◽

Computation Time ◽

Genotype Imputation ◽

Reference Panel ◽

Genome Wide Association Studies ◽

Panel Size ◽

Genome Wide ◽

New Genotype ◽

Reference Samples

AbstractGenotype imputation is commonly performed in genome-wide association studies because it greatly increases the number of markers that can be tested for association with a trait. In general, one should perform genotype imputation using the largest reference panel that is available because the number of accurately imputed variants increases with reference panel size. However, one impediment to using larger reference panels is the increased computational cost of imputation. We present a new genotype imputation method, Beagle 5.0, which greatly reduces the computational cost of imputation from large reference panels. We compare Beagle 5.0 with Beagle 4.1, Impute4, Minimac3, and Minimac4 using 1000 Genomes Project data, Haplotype Reference Consortium data, and simulated data for 10k, 100k, 1M, and 10M reference samples. All methods produce nearly identical accuracy, but Beagle 5.0 has the lowest computation time and the best scaling of computation time with increasing reference panel size. For 10k, 100k, 1M, and 10M reference samples and 1000 phased target samples, Beagle 5.0’s computation time is 3× (10k), 12× (100k), 43× (1M), and 533× (10M) faster than the fastest alternative method. Cost data from the Amazon Elastic Compute Cloud show that Beagle 5.0 can perform genome-wide imputation from 10M reference samples into 1000 phased target samples at a cost of less than one US cent per sample.Beagle 5.0 is freely available from https://faculty.washington.edu/browning/beagle/beagle.html.

Download Full-text

Sequencing and Imputation in GWAS: Cost-Effective Strategies to Increase Power and Genomic Coverage Across Diverse Populations

10.1101/548321 ◽

2019 ◽

Cited By ~ 2

Author(s):

Corbin Quick ◽

Pramod Anugu ◽

Solomon Musani ◽

Scott T. Weiss ◽

Esteban G. Burchard ◽

...

Keyword(s):

Genetic Variation ◽

Statistical Power ◽

Rare Variants ◽

Association Studies ◽

Cost Effective ◽

Genotype Imputation ◽

Reference Panel ◽

Genome Wide Association Studies ◽

Full Spectrum ◽

Study Participants

ABSTRACTA key aim for current genome-wide association studies (GWAS) is to interrogate the full spectrum of genetic variation underlying human traits, including rare variants, across populations. Deep whole-genome sequencing is the gold standard to capture the full spectrum of genetic variation, but remains prohibitively expensive for large samples. Array genotyping interrogates a sparser set of variants, which can be used as a scaffold for genotype imputation to capture variation across a wider set of variants. However, imputation coverage and accuracy depend crucially on the reference panel size and genetic distance from the target population.Here, we consider a strategy in which a subset of study participants is sequenced and the rest array-genotyped and imputed using a reference panel that comprises the sequenced study participants and individuals from an external reference panel. We systematically assess how imputation quality and statistical power for association depend on the number of individuals sequenced and included in the reference panel for two admixed populations (African and Latino Americans) and two European population isolates (Sardinians and Finns). We develop a framework to identify powerful and cost-effective GWAS designs in these populations given current sequencing and array genotyping costs. For populations that are well-represented in current reference panels, we find that array genotyping alone is cost-effective and well-powered to detect both common- and rare-variant associations. For poorly represented populations, we find that sequencing a subset of study participants to improve imputation is often more cost-effective than array genotyping alone, and can substantially increase genomic coverage and power.

Download Full-text

Evaluation and application of summary statistic imputation to discover new height-associated loci

10.1101/204560 ◽

2017 ◽

Author(s):

Sina Rüeger ◽

Aaron McDaid ◽

Zoltán Kutalik

Keyword(s):

Genetic Variants ◽

Association Studies ◽

Low Frequency ◽

Cost Effective ◽

Genotype Imputation ◽

Genome Wide Association Studies ◽

Summary Statistics ◽

Uk Biobank ◽

Genome Wide ◽

The Uk

AbstractAs most of the heritability of complex traits is attributed to common and low frequency genetic variants, imputing them by combining genotyping chips and large sequenced reference panels is the most cost-effective approach to discover the genetic basis of these traits. Association summary statistics from genome-wide meta-analyses are available for hundreds of traits. Updating these to ever-increasing reference panels is very cumbersome as it requires reimputation of the genetic data, rerunning the association scan, and meta-analysing the results. A much more efficient method is to directly impute the summary statistics, termed as summary statistics imputation. Its performance relative to genotype imputation and practical utility has not yet been fully investigated. To this end, we compared the two approaches on real (genotyped and imputed) data from 120K samples from the UK Biobank and show that, while genotype imputation boasts a 2- to 5-fold lower root-mean-square error, summary statistics imputation better distinguishes true associations from null ones: We observed the largest differences in power for variants with low minor allele frequency and low imputation quality. For fixed false positive rates of 0.001, 0.01, 0.05, using summary statistics imputation yielded an increase in statistical power by 15, 10 and 3%, respectively. To test its capacity to discover novel associations, we applied summary statistics imputation to the GIANT height meta-analysis summary statistics covering HapMap variants, and identified 34 novel loci, 19 of which replicated using data in the UK Biobank. Additionally, we successfully replicated 55 out of the 111 variants published in an exome chip study. Our study demonstrates that summary statistics imputation is a very efficient and cost-effective way to identify and fine-map trait-associated loci. Moreover, the ability to impute summary statistics is important for follow-up analyses, such as Mendelian randomisation or LD-score regression.Author summaryGenome-wide association studies (GWASs) quantify the effect of genetic variants and traits, such as height. Such estimates are called association summary statistics and are typically publicly shared through publication. Typically, GWASs are carried out by genotyping ~ 500′000 SNVs for each individual which are then combined with sequenced reference panels to infer untyped SNVs in each’ individuals genome. This process of genotype imputation is resource intensive and can therefore be a limitation when combining many GWASs. An alternative approach is to bypass the use of individual data and directly impute summary statistics. In our work we compare the performance of summary statistics imputation to genotype imputation. Although we observe a 2- to 5-fold lower RMSE for genotype imputation compared to summary statistics imputation, summary statistics imputation better distinguishes true associations from null results. Furthermore, we demonstrate the potential of summary statistics imputation by presenting 34 novel height-associated loci, 19 of which were confirmed in UK Biobank. Our study demonstrates that given current reference panels, summary statistics imputation is a very efficient and cost-effective way to identify common or low-frequency trait-associated loci.

Download Full-text

Low frequency and rare coding variation contributes to multiple sclerosis risk

10.1101/286617 ◽

2018 ◽

Author(s):

◽

Mitja Mitrovic ◽

Nikolaos Patsopoulos ◽

Ashley Beecham ◽

Theresa Dankowski ◽

...

Keyword(s):

Multiple Sclerosis ◽

Association Studies ◽

Low Frequency ◽

Frequency Variation ◽

Genome Wide Association Studies ◽

T Cell Homeostasis ◽

Gene Coding ◽

Genome Wide ◽

Common Genetic Variants ◽

Coding Variants

AbstractMultiple sclerosis is a common, complex neurological disease, where almost 20% of risk heritability can be attributed to common genetic variants, including >230 identified by genome-wide association studies (Patsopoulos et al., 2017). Multiple strands of evidence suggest that the majority of the remaining heritability is also due to the additive effects of individual variants, rather than epistatic interactions between these variants, or mutations exclusive to individual families. Here, we show in 68,379 cases and controls that as much as 5% of this heritability is explained by low-frequency variation in gene coding sequence. We identify four novel genes driving MS risk independently of common variant signals, which highlight a key role for regulatory T cell homeostasis and regulation, IFNγ biology and NFκB signaling in MS pathogenesis. As low-frequency variants do not show substantial linkage disequilibrium with other variants, and as coding variants are more interpretable and experimentally tractable than non-coding variation, our discoveries constitute a rich resource for dissecting the pathobiology of MS.

Download Full-text