Comparison of genotype imputation strategies using a combined reference panel for chicken population

The large-scale open access whole-exome sequencing (WES) data of the UK Biobank ~200,000 participants is accelerating a new wave of genetic association studies aiming to identify rare and functional loss-of-function (LoF) variants associated with a broad range of complex traits and diseases, however the community is in short of stringent replication of new associations. In this study, we proposed to merge the WES genotypes and the genome-wide genotyping (GWAS) genotypes of 167,000 UKB Caucasian participants into a combined reference panel, and then to impute 241,911 UKB Caucasian participants who had the GWAS genotypes only. We then proposed to use the imputed data to replicate association identified in the discovery WES sample. Using a leave-100-out imputation strategy in the reference panel, we showed that average imputation accuracy measure r2 is modest to high at LoF variants of all minor allele frequency (MAF) intervals including ultra-rare ones: 0.942 at MAF interval [1%, 50%], 0.807 at [0.1%, 1.0%), 0.805 at [0.01%, 0.1%), 0.664 at [0.001%, 0.01%) and 0.410 at (0, 0.001%). As applications, we studied single variant level and gene level associations of LoF variants with estimated heel BMD (eBMD) and 4 lipid traits: high-density-lipoprotein cholesterol (HDL-C), low-density-lipoprotein cholesterol (LDL-C), triglycerides (TG) and total cholesterol (TC). In addition to replicating dozens of previously reported genes such as MEPE for eBMD and PCSK9 for more than one lipid trait, the results also identified 2 novel gene-level associations: PLIN1 (cumulative MAF=0.10%, discovery BETA=0.38, P=1.20X10-13; replication BETA=0.25, P=1.03X10-6) and ANGPTL3 (cumulative MAF=0.10%, discovery BETA=−0.36, P=4.70X10-11; replication BETA=−0.30, P=6.60X10-11) for HDL-C, as well as one novel single variant level association (11:14843853:C:T, MAF=0.11%, discovery BETA=−0.31, P=2.70X10-9; replication BETA=−0.31, P=8.80X10-14, PDE3B) for TG. Our results highlighted the strength of WES based genotype imputation as well as provided useful imputed data within the UKB cohort.

Download Full-text

Genotype imputation using the Positional Burrows Wheeler Transform

PLoS Genetics ◽

10.1371/journal.pgen.1009049 ◽

2020 ◽

Vol 16 (11) ◽

pp. e1009049

Author(s):

Simone Rubinacci ◽

Olivier Delaneau ◽

Jonathan Marchini

Keyword(s):

Computational Cost ◽

Computation Time ◽

Imputation Method ◽

Genotype Imputation ◽

Reference Panel ◽

Imputation Methods ◽

Panel Size ◽

Burrows Wheeler Transform ◽

Made In ◽

Memory Efficient

Genotype imputation is the process of predicting unobserved genotypes in a sample of individuals using a reference panel of haplotypes. In the last 10 years reference panels have increased in size by more than 100 fold. Increasing reference panel size improves accuracy of markers with low minor allele frequencies but poses ever increasing computational challenges for imputation methods. Here we present IMPUTE5, a genotype imputation method that can scale to reference panels with millions of samples. This method continues to refine the observation made in the IMPUTE2 method, that accuracy is optimized via use of a custom subset of haplotypes when imputing each individual. It achieves fast, accurate, and memory-efficient imputation by selecting haplotypes using the Positional Burrows Wheeler Transform (PBWT). By using the PBWT data structure at genotyped markers, IMPUTE5 identifies locally best matching haplotypes and long identical by state segments. The method then uses the selected haplotypes as conditioning states within the IMPUTE model. Using the HRC reference panel, which has ∼65,000 haplotypes, we show that IMPUTE5 is up to 30x faster than MINIMAC4 and up to 3x faster than BEAGLE5.1, and uses less memory than both these methods. Using simulated reference panels we show that IMPUTE5 scales sub-linearly with reference panel size. For example, keeping the number of imputed markers constant, increasing the reference panel size from 10,000 to 1 million haplotypes requires less than twice the computation time. As the reference panel increases in size IMPUTE5 is able to utilize a smaller number of reference haplotypes, thus reducing computational cost.

Download Full-text

A reference panel of 64,976 haplotypes for genotype imputation

10.1101/035170 ◽

2015 ◽

Cited By ~ 17

Author(s):

Shane McCarthy ◽

Sayantan Das ◽

Warren Kretzschmar ◽

Olivier Delaneau ◽

Andrew R. Wood ◽

...

Keyword(s):

Genome Sequence ◽

Sequence Data ◽

Association Studies ◽

Allele Frequencies ◽

Genotype Imputation ◽

Reference Panel ◽

Whole Genome Sequence ◽

European Ancestry ◽

Whole Genome ◽

Remote Server

We describe a reference panel of 64,976 human haplotypes at 39,235,157 SNPs constructed using whole genome sequence data from 20 studies of predominantly European ancestry. Using this resource leads to accurate genotype imputation at minor allele frequencies as low as 0.1%, a large increase in the number of SNPs tested in association studies and can help to discover and refine causal loci. We describe remote server resources that allow researchers to carry out imputation and phasing consistently and efficiently.

Download Full-text

Improved imputation of summary statistics for admixed populations

10.1101/203927 ◽

2017 ◽

Cited By ~ 3

Author(s):

Sina Rüeger ◽

Aaron McDaid ◽

Zoltán Kutalik

Keyword(s):

Meta Analysis ◽

Country Of Origin ◽

Real Data ◽

Genotype Imputation ◽

Reference Panel ◽

Summary Statistics ◽

Panel Size ◽

Noticeable Improvement

AbstractMotivationSummary statistics imputation can be used to infer association summary statistics of an already conducted, genotype-based meta-analysis to higher ge-nomic resolution. This is typically needed when genotype imputation is not feasible for some cohorts. Oftentimes, cohorts of such a meta-analysis are variable in terms of (country of) origin or ancestry. This violates the assumption of current methods that an external LD matrix and the covariance of the Z-statistics are identical.ResultsTo address this issue, we present variance matching, an extention to the existing summary statistics imputation method, which manipulates the LD matrix needed for summary statistics imputation. Based on simulations using real data we find that accounting for ancestry admixture yields noticeable improvement only when the total reference panel size is > 1000. We show that for population specific variants this effect is more pronounced with increasing FST.

Download Full-text

PHARP: A pig haplotype reference panel for genotype imputation

10.1101/2021.06.03.446888 ◽

2021 ◽

Author(s):

Zhen Wang ◽

Zhenyang Zhang ◽

Zitao Chen ◽

Jiabao Sun ◽

Caiyun Cao ◽

...

Keyword(s):

Complex Traits ◽

Sequence Data ◽

Genotype Imputation ◽

Reference Panel ◽

Whole Genome Sequence ◽

Sequencing Data ◽

Large White ◽

Downstream Analysis ◽

Low Coverage ◽

Analytical Tools

Pigs not only function as a major meat source worldwide but also are commonly used as an animal model for studying human complex traits. A large haplotype reference panel has been used to facilitate efficient phasing and imputation of relatively sparse genome-wide microarray chips and low-coverage sequencing data. Using the imputed genotypes in the downstream analysis, such as GWASs, TWASs, eQTL mapping and genomic prediction (GS), is beneficial for obtaining novel findings. However, currently, there is still a lack of publicly available and high-quality pig reference panels with large sample sizes and high diversity, which greatly limits the application of genotype imputation in pigs. In response, we built the pig Haplotype Reference Panel (PHARP) database. PHARP provides a reference panel of 2,012 pig haplotypes at 34 million SNPs constructed using whole-genome sequence data from more than 49 studies of 71 pig breeds. It also provides Web-based analytical tools that allow researchers to carry out phasing and imputation consistently and efficiently. PHARP is freely accessible at http://alphaindex.zju.edu.cn/PHARP/index.php. We demonstrate its applicability for pig commercial 50K SNP arrays, by accurately imputing 2.6 billion genotypes at a concordance rate value of 0.971 in 81 Large White pigs (~ 17x sequencing coverage). We also applied our reference panel to impute the low-density SNP chip into the high-density data for three GWASs and found novel significantly associated SNPs that might be casual variants.

Download Full-text

Genotype imputation and reference panel: a systematic evaluation on haplotype size and diversity

Briefings in Bioinformatics ◽

10.1093/bib/bbz108 ◽

2019 ◽

Vol 21 (5) ◽

pp. 1806-1817 ◽

Cited By ~ 1

Author(s):

Wei-Yang Bai ◽

Xiao-Wei Zhu ◽

Pei-Kuan Cong ◽

Xue-Jun Zhang ◽

J Brent Richards ◽

...

Keyword(s):

Chinese Population ◽

Large Scale ◽

Imputation Accuracy ◽

European Population ◽

Han Chinese ◽

Population Diversity ◽

Genotype Imputation ◽

Reference Panel ◽

Systematic Evaluation ◽

Han Chinese Population

Abstract Here, 622 imputations were conducted with 394 customized reference panels for Han Chinese and European populations. Besides validating the fact that imputation accuracy could always benefit from the increased panel size when the reference panel was population specific, the results brought two new thoughts. First, when the haplotype size of the reference panel was fixed, the imputation accuracy of common and low-frequency variants (Minor Allele Frequency (MAF) > 0.5%) decreased while the population diversity of the reference panel increased, but for rare variants (MAF < 0.5%), a small fraction of diversity in panel could improve imputation accuracy. Second, when the haplotype size of the reference panel was increased with extra population-diverse samples, the imputation accuracy of common variants (MAF > 5%) for the European population could always benefit from the expanding sample size. However, for the Han Chinese population, the accuracy of all imputed variants reached the highest when reference panel contained a fraction of an extra diverse sample (8–21%). In addition, we evaluated the imputation performances in the existing reference panels, such as the Haplotype Reference Consortium (HRC), 1000 Genomes Project Phase 3 and the China, Oxford and Virginia Commonwealth University Experimental Research on Genetic Epidemiology (CONVERGE). For the European population, the HRC panel showed the best performance in our analysis. For the Han Chinese population, we proposed an optimum imputation reference panel constituent ratio if researchers would like to customize their own sequenced reference panel, but a high-quality and large-scale Chinese reference panel was still needed. Our findings could be generalized to the other populations with conservative genome; a tool was provided to investigate other populations of interest (https://github.com/Abyss-bai/reference-panel-reconstruction).

Download Full-text

The impact of non-additive genetic associations on age-related complex diseases

Nature Communications ◽

10.1038/s41467-021-21952-4 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Marta Guindo-Martínez ◽

◽

Ramon Amela ◽

Silvia Bonàs-Guarch ◽

Montserrat Puiggròs ◽

...

Keyword(s):

X Chromosome ◽

Additive Model ◽

Complex Diseases ◽

Additive Models ◽

Genotype Imputation ◽

Reference Panel ◽

Genome Wide Association Studies ◽

Age Related ◽

Genome Wide ◽

The Impact

AbstractGenome-wide association studies (GWAS) are not fully comprehensive, as current strategies typically test only the additive model, exclude the X chromosome, and use only one reference panel for genotype imputation. We implement an extensive GWAS strategy, GUIDANCE, which improves genotype imputation by using multiple reference panels and includes the analysis of the X chromosome and non-additive models to test for association. We apply this methodology to 62,281 subjects across 22 age-related diseases and identify 94 genome-wide associated loci, including 26 previously unreported. Moreover, we observe that 27.7% of the 94 loci are missed if we use standard imputation strategies with a single reference panel, such as HRC, and only test the additive model. Among the new findings, we identify three novel low-frequency recessive variants with odds ratios larger than 4, which need at least a three-fold larger sample size to be detected under the additive model. This study highlights the benefits of applying innovative strategies to better uncover the genetic architecture of complex diseases.

Download Full-text

Increasing the resolution and precision of psychiatric GWAS by re-imputing summary statistics using a large, diverse reference panel

10.1101/496570 ◽

2018 ◽

Author(s):

Chris Chatzinakos ◽

Donghyung Lee ◽

Na Cai ◽

Vladimir I. Vladimirov ◽

Bradley T. Webb ◽

...

Keyword(s):

Large Scale ◽

Association Studies ◽

Genome Project ◽

Genotype Imputation ◽

Reference Panel ◽

Genome Wide Association Studies ◽

Summary Statistics ◽

Common Variants ◽

Post Traumatic Stress ◽

Mixed Ancestry

ABSTRACTGenotype imputation across populations of mixed ancestry is critical for optimal discovery in large-scale genome-wide association studies (GWAS). Methods for direct imputation of GWAS summary statistics were previously shown to be practically as accurate as summary statistics produced after raw genotype imputation, while incurring orders of magnitude lower computational burden. Given that direct imputation needs a precise estimation of linkage-disequilibrium (LD) and that most of the methods using a small reference panel e.g., ~2,500 subject coming from the 1000 Genome Project, there is a great need for much larger and more diverse reference panels. To accurately estimate the LD needed for an exhaustive analysis of any cosmopolitan cohort, we developed DISTMIX2. DISTMIX2: i) uses a much larger and more diverse reference panel and ii) estimates weights of ethnic mixture based solely on Z-scores (when AFs are not available). We applied DISTMIX2 to GWAS summary statistics from the Psychiatric Genetic Consortium (PGC). DISTMIX2 uncovered signals in numerous new regions, with most of these findings coming from the rarer variants. Rarer variants provide much sharper location for the signals compared with common variants, as the LD for rare variants extends over a lower distance than for common ones. For example, while the original PGC post-traumatic stress disorder (PTSD) study found only 3 marginal signals for common variants, we now uncover a very strong signal for a rare variant in PKN2, a gene associated with neuronal and hippocampal development. Thus, DISTMIX2 provides a robust and fast (re)imputation approach for most Psychiatric GWAS studies.

Download Full-text

Whole-genome reference panel of 1,781 Northeast Asians improves imputation accuracy of rare and low-frequency variants

10.1101/600353 ◽

2019 ◽

Cited By ~ 1

Author(s):

Seong-Keun Yoo ◽

Chang-Uk Kim ◽

Hie Lim Kim ◽

Sungjae Kim ◽

Jong-Yeon Shin ◽

...

Keyword(s):

Imputation Accuracy ◽

Northeast Asia ◽

Low Frequency ◽

Genotype Imputation ◽

Reference Panel ◽

Whole Genome Sequencing Data ◽

Reference Database ◽

Whole Genome ◽

Sequencing Data ◽

Northeast Asian

AbstractGenotype imputation using the reference panel is a cost-effective strategy to fill millions of missing genotypes for the purpose of various genetic analyses. Here, we present the Northeast Asian Reference Database (NARD), including whole-genome sequencing data of 1,781 individuals from Korea, Mongolia, Japan, China, and Hong Kong. NARD provides the genetic diversities of Korean (n=850) and Mongolian (n=386) ancestries that were not present in the 1000 Genomes Project Phase 3 (1KGP3). We combined and re-phased the genotypes from NARD and 1KGP3 to construct a union set of haplotypes. This approach established a robust imputation reference panel for the Northeast Asian populations, which yields the greatest imputation accuracy of rare and low-frequency variants compared with the existing panels. Also, we illustrate that NARD can potentially improve disease variant discovery by reducing pathogenic candidates. Overall, this study provides a decent reference panel for the genetic studies in Northeast Asia.

Download Full-text