scholarly journals Leveraging TOPMed Imputation Server and Constructing a Cohort-Specific Imputation Reference Panel to Enhance Genotype Imputation among Cystic Fibrosis Patients

Author(s):  
Quan Sun ◽  
Weifang Liu ◽  
Jonathan D. Rosen ◽  
Le Huang ◽  
Rhonda G. Pace ◽  
...  
2021 ◽  
Author(s):  
Quan Sun ◽  
Weifang Liu ◽  
Jonathan D Rosen ◽  
Le Huang ◽  
Rhonda G Pace ◽  
...  

Cystic fibrosis (CF) is a severe genetic disorder that can cause multiple comorbidities affecting the lungs, the pancreas, the luminal digestive system and beyond. In our previous genome-wide association studies (GWAS), we genotyped ~8,000 CF samples using a mixture of different genotyping platforms. More recently, the Cystic Fibrosis Genome Project (CFGP) performed deep (~30x) whole genome sequencing (WGS) of 5,095 samples to better understand the genetic mechanisms underlying clinical heterogeneity among CF patients. For mixtures of GWAS array and WGS data, genotype imputation has proven effective in increasing effective sample size. Therefore, we first performed imputation for the ~8,000 CF samples with GWAS array genotype using the TOPMed freeze 8 reference panel. Our results demonstrate that TOPMed can provide high-quality imputation for CF patients, boosting genomic coverage from ~0.3 - 4.2 million genotyped markers to ~11 - 43 million well-imputed markers, and significantly improving Polygenic Risk Score (PRS) prediction accuracy. Furthermore, we built a CF-specific CFGP reference panel based on WGS data of CF patients. We demonstrate that despite having ~3% the sample size of TOPMed, our CFGP reference panel can still outperform TOPMed when imputing some CF disease-causing variants, likely due to allele and haplotype differences between CF patients and general populations. We anticipate our imputed data for 4,656 samples without WGS data will benefit our subsequent genetic association studies, and the CFGP reference panel built from CF WGS samples will benefit other investigators studying CF.


2021 ◽  
Author(s):  
Lei Zhang ◽  
Shan-Shan Yan ◽  
Jing-Jing Ni ◽  
Yu-Fang Pei

The large-scale open access whole-exome sequencing (WES) data of the UK Biobank ~200,000 participants is accelerating a new wave of genetic association studies aiming to identify rare and functional loss-of-function (LoF) variants associated with a broad range of complex traits and diseases, however the community is in short of stringent replication of new associations. In this study, we proposed to merge the WES genotypes and the genome-wide genotyping (GWAS) genotypes of 167,000 UKB Caucasian participants into a combined reference panel, and then to impute 241,911 UKB Caucasian participants who had the GWAS genotypes only. We then proposed to use the imputed data to replicate association identified in the discovery WES sample. Using a leave-100-out imputation strategy in the reference panel, we showed that average imputation accuracy measure r2 is modest to high at LoF variants of all minor allele frequency (MAF) intervals including ultra-rare ones: 0.942 at MAF interval [1%, 50%], 0.807 at [0.1%, 1.0%), 0.805 at [0.01%, 0.1%), 0.664 at [0.001%, 0.01%) and 0.410 at (0, 0.001%). As applications, we studied single variant level and gene level associations of LoF variants with estimated heel BMD (eBMD) and 4 lipid traits: high-density-lipoprotein cholesterol (HDL-C), low-density-lipoprotein cholesterol (LDL-C), triglycerides (TG) and total cholesterol (TC). In addition to replicating dozens of previously reported genes such as MEPE for eBMD and PCSK9 for more than one lipid trait, the results also identified 2 novel gene-level associations: PLIN1 (cumulative MAF=0.10%, discovery BETA=0.38, P=1.20X10-13; replication BETA=0.25, P=1.03X10-6) and ANGPTL3 (cumulative MAF=0.10%, discovery BETA=−0.36, P=4.70X10-11; replication BETA=−0.30, P=6.60X10-11) for HDL-C, as well as one novel single variant level association (11:14843853:C:T, MAF=0.11%, discovery BETA=−0.31, P=2.70X10-9; replication BETA=−0.31, P=8.80X10-14, PDE3B) for TG. Our results highlighted the strength of WES based genotype imputation as well as provided useful imputed data within the UKB cohort.


PLoS Genetics ◽  
2020 ◽  
Vol 16 (11) ◽  
pp. e1009049
Author(s):  
Simone Rubinacci ◽  
Olivier Delaneau ◽  
Jonathan Marchini

Genotype imputation is the process of predicting unobserved genotypes in a sample of individuals using a reference panel of haplotypes. In the last 10 years reference panels have increased in size by more than 100 fold. Increasing reference panel size improves accuracy of markers with low minor allele frequencies but poses ever increasing computational challenges for imputation methods. Here we present IMPUTE5, a genotype imputation method that can scale to reference panels with millions of samples. This method continues to refine the observation made in the IMPUTE2 method, that accuracy is optimized via use of a custom subset of haplotypes when imputing each individual. It achieves fast, accurate, and memory-efficient imputation by selecting haplotypes using the Positional Burrows Wheeler Transform (PBWT). By using the PBWT data structure at genotyped markers, IMPUTE5 identifies locally best matching haplotypes and long identical by state segments. The method then uses the selected haplotypes as conditioning states within the IMPUTE model. Using the HRC reference panel, which has ∼65,000 haplotypes, we show that IMPUTE5 is up to 30x faster than MINIMAC4 and up to 3x faster than BEAGLE5.1, and uses less memory than both these methods. Using simulated reference panels we show that IMPUTE5 scales sub-linearly with reference panel size. For example, keeping the number of imputed markers constant, increasing the reference panel size from 10,000 to 1 million haplotypes requires less than twice the computation time. As the reference panel increases in size IMPUTE5 is able to utilize a smaller number of reference haplotypes, thus reducing computational cost.


2015 ◽  
Author(s):  
Shane McCarthy ◽  
Sayantan Das ◽  
Warren Kretzschmar ◽  
Olivier Delaneau ◽  
Andrew R. Wood ◽  
...  

We describe a reference panel of 64,976 human haplotypes at 39,235,157 SNPs constructed using whole genome sequence data from 20 studies of predominantly European ancestry. Using this resource leads to accurate genotype imputation at minor allele frequencies as low as 0.1%, a large increase in the number of SNPs tested in association studies and can help to discover and refine causal loci. We describe remote server resources that allow researchers to carry out imputation and phasing consistently and efficiently.


2017 ◽  
Author(s):  
Sina Rüeger ◽  
Aaron McDaid ◽  
Zoltán Kutalik

AbstractMotivationSummary statistics imputation can be used to infer association summary statistics of an already conducted, genotype-based meta-analysis to higher ge-nomic resolution. This is typically needed when genotype imputation is not feasible for some cohorts. Oftentimes, cohorts of such a meta-analysis are variable in terms of (country of) origin or ancestry. This violates the assumption of current methods that an external LD matrix and the covariance of the Z-statistics are identical.ResultsTo address this issue, we present variance matching, an extention to the existing summary statistics imputation method, which manipulates the LD matrix needed for summary statistics imputation. Based on simulations using real data we find that accounting for ancestry admixture yields noticeable improvement only when the total reference panel size is > 1000. We show that for population specific variants this effect is more pronounced with increasing FST.


2021 ◽  
Author(s):  
Zhen Wang ◽  
Zhenyang Zhang ◽  
Zitao Chen ◽  
Jiabao Sun ◽  
Caiyun Cao ◽  
...  

Pigs not only function as a major meat source worldwide but also are commonly used as an animal model for studying human complex traits. A large haplotype reference panel has been used to facilitate efficient phasing and imputation of relatively sparse genome-wide microarray chips and low-coverage sequencing data. Using the imputed genotypes in the downstream analysis, such as GWASs, TWASs, eQTL mapping and genomic prediction (GS), is beneficial for obtaining novel findings. However, currently, there is still a lack of publicly available and high-quality pig reference panels with large sample sizes and high diversity, which greatly limits the application of genotype imputation in pigs. In response, we built the pig Haplotype Reference Panel (PHARP) database. PHARP provides a reference panel of 2,012 pig haplotypes at 34 million SNPs constructed using whole-genome sequence data from more than 49 studies of 71 pig breeds. It also provides Web-based analytical tools that allow researchers to carry out phasing and imputation consistently and efficiently. PHARP is freely accessible at http://alphaindex.zju.edu.cn/PHARP/index.php. We demonstrate its applicability for pig commercial 50K SNP arrays, by accurately imputing 2.6 billion genotypes at a concordance rate value of 0.971 in 81 Large White pigs (~ 17x sequencing coverage). We also applied our reference panel to impute the low-density SNP chip into the high-density data for three GWASs and found novel significantly associated SNPs that might be casual variants.


animal ◽  
2019 ◽  
Vol 13 (6) ◽  
pp. 1119-1126 ◽  
Author(s):  
S. Ye ◽  
X. Yuan ◽  
S. Huang ◽  
H. Zhang ◽  
Z. Chen ◽  
...  

2019 ◽  
Vol 21 (5) ◽  
pp. 1806-1817 ◽  
Author(s):  
Wei-Yang Bai ◽  
Xiao-Wei Zhu ◽  
Pei-Kuan Cong ◽  
Xue-Jun Zhang ◽  
J Brent Richards ◽  
...  

Abstract Here, 622 imputations were conducted with 394 customized reference panels for Han Chinese and European populations. Besides validating the fact that imputation accuracy could always benefit from the increased panel size when the reference panel was population specific, the results brought two new thoughts. First, when the haplotype size of the reference panel was fixed, the imputation accuracy of common and low-frequency variants (Minor Allele Frequency (MAF) > 0.5%) decreased while the population diversity of the reference panel increased, but for rare variants (MAF < 0.5%), a small fraction of diversity in panel could improve imputation accuracy. Second, when the haplotype size of the reference panel was increased with extra population-diverse samples, the imputation accuracy of common variants (MAF > 5%) for the European population could always benefit from the expanding sample size. However, for the Han Chinese population, the accuracy of all imputed variants reached the highest when reference panel contained a fraction of an extra diverse sample (8–21%). In addition, we evaluated the imputation performances in the existing reference panels, such as the Haplotype Reference Consortium (HRC), 1000 Genomes Project Phase 3 and the China, Oxford and Virginia Commonwealth University Experimental Research on Genetic Epidemiology (CONVERGE). For the European population, the HRC panel showed the best performance in our analysis. For the Han Chinese population, we proposed an optimum imputation reference panel constituent ratio if researchers would like to customize their own sequenced reference panel, but a high-quality and large-scale Chinese reference panel was still needed. Our findings could be generalized to the other populations with conservative genome; a tool was provided to investigate other populations of interest (https://github.com/Abyss-bai/reference-panel-reconstruction).


2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Marta Guindo-Martínez ◽  
◽  
Ramon Amela ◽  
Silvia Bonàs-Guarch ◽  
Montserrat Puiggròs ◽  
...  

AbstractGenome-wide association studies (GWAS) are not fully comprehensive, as current strategies typically test only the additive model, exclude the X chromosome, and use only one reference panel for genotype imputation. We implement an extensive GWAS strategy, GUIDANCE, which improves genotype imputation by using multiple reference panels and includes the analysis of the X chromosome and non-additive models to test for association. We apply this methodology to 62,281 subjects across 22 age-related diseases and identify 94 genome-wide associated loci, including 26 previously unreported. Moreover, we observe that 27.7% of the 94 loci are missed if we use standard imputation strategies with a single reference panel, such as HRC, and only test the additive model. Among the new findings, we identify three novel low-frequency recessive variants with odds ratios larger than 4, which need at least a three-fold larger sample size to be detected under the additive model. This study highlights the benefits of applying innovative strategies to better uncover the genetic architecture of complex diseases.


Sign in / Sign up

Export Citation Format

Share Document