scholarly journals High-depth whole genome sequencing of a large population-specific reference panel: Enhancing sensitivity, accuracy, and imputation

2017 ◽  
Author(s):  
Todd Lencz ◽  
Jin Yu ◽  
Cameron Palmer ◽  
Shai Carmi ◽  
Danny Ben-Avraham ◽  
...  

AbstractBackgroundWhile increasingly large reference panels for genome-wide imputation have been recently made available, the degree to which imputation accuracy can be enhanced by population-specific reference panels remains an open question. In the present study, we sequenced at full-depth (≥30x) a moderately large (n=738) cohort of samples drawn from the Ashkenazi Jewish population across two platforms (Illumina X Ten and Complete Genomics, Inc.). We developed and refined a series of quality control steps to optimize sensitivity, specificity, and comprehensiveness of variant calls in the reference panel, and then tested the accuracy of imputation against target cohorts drawn from the same population.ResultsFor samples sequenced on the Illumina X Ten platform, quality thresholds were identified that permitted highly accurate calling of single nucleotide variants across 94% of the genome. The Complete Genomics, Inc. platform was more conservative (fewer variants called) compared to the Illumina platform, but also demonstrated relatively greater numbers of false positives that needed to be filtered. Quality control procedures also permitted detection of novel genome reads that are not mapped to current reference or alternate assemblies. After stringent quality control, the population-specific reference panel produced more accurate and comprehensive imputation results relative to publicly available, large cosmopolitan reference panels. The population-specific reference panel also permitted enhanced filtering of clinically irrelevant variants from personal genomes.ConclusionsOur primary results demonstrate enhanced accuracy of a population-specific imputation panel relative to cosmopolitan panels, especially in the range of infrequent (<5% non-reference allele frequency) and rare (<1% non-reference allele frequency) variants that may be most critical to further progress in mapping of complex phenotypes.


2017 ◽  
Vol 7 (1) ◽  
Author(s):  
Meraj Ahmad ◽  
Anubhav Sinha ◽  
Sreya Ghosh ◽  
Vikrant Kumar ◽  
Sonia Davila ◽  
...  


2020 ◽  
Vol 15 ◽  
Author(s):  
Weiwen Zhang ◽  
Long Wang ◽  
Theint Theint Aye

Background: Asia is the largest continent in the world with a large group of populations. However, we are still in lack of an imputation server with an Asian-specific reference panel to estimate genotypes for genome wide association study in Asia. Currently, two well-known imputation servers are available, i.e., Michigan imputation server in the US and Sanger in the UK. However, the quality of imputation for Southeast Asia's populations is not satisfying by using their genotype imputation services and reference panels. Objective: In this paper, we develop ModStore imputation server with a specially designed reference panel to offer genotype imputation as a service, aiming to increase the power of genome wide association study of Singapore in the context of National Precision Medicine. Method: We present the implementation and customization of ModStore imputation server on high performance computing infrastructure. Meanwhile, we construct a reference panel based on whole-genome sequencing of Singaporeans, referred to as the SG10K reference panel, for improving the imputation accuracy of Southeast Asia's populations. Results: Experiment results show that by using the SG10K reference panel, over 79% improvement of mean Rsq can be achieved for the imputation of three Singapore ethnic populations data set, i.e., Malay, Chinese, and Indian, under MAF<0.005 compared to the 1000 Genome reference panel. Conclusion: With ModStore imputation server, genotype imputation can be performed more accurately for data derived from array-based pharmacogenomics and pre-existing Southeast Asia's population-scale genetic.



2018 ◽  
Author(s):  
Saurabh Belsare ◽  
Michal Sakin-Levy ◽  
Yulia Mostovoy ◽  
Steffen Durinck ◽  
Subhra Chaudhry ◽  
...  

ABSTRACTData from the 1000 Genomes project is quite often used as a reference for human genomic analysis. However, its accuracy needs to be assessed to understand the quality of predictions made using this reference. We present here an assessment of the genotype, phasing, and imputation accuracy data in the 1000 Genomes project. We compare the phased haplotype calls from the 1000 Genomes project to experimentally phased haplotypes for 28 of the same individuals sequenced using the 10X Genomics platform. We observe that phasing and imputation for rare variants are unreliable, which likely reflects the limited sample size of the 1000 Genomes project data. Further, it appears that using a population specific reference panel does not improve the accuracy of imputation over using the entire 1000 Genomes data set as a reference panel. We also note that the error rates and trends depend on the choice of definition of error, and hence any error reporting needs to take these definitions into account.



2020 ◽  
Vol 2 (2) ◽  
Author(s):  
Jarmo Ritari ◽  
Kati Hyvärinen ◽  
Jonna Clancy ◽  
Jukka Partanen ◽  
Satu Koskela ◽  
...  

Abstract The HLA genes, the most polymorphic genes in the human genome, constitute the strongest single genetic susceptibility factor for autoimmune diseases, transplantation alloimmunity and infections. HLA imputation via statistical inference of alleles based on single-nucleotide polymorphisms (SNPs) in linkage disequilibrium (LD) with alleles is a powerful first-step screening tool. Due to different LD structures between populations, the accuracy of HLA imputation may benefit from matching the imputation reference with the study population. To evaluate the potential advantage of using population-specific reference in HLA imputation, we constructed an HLA reference panel consisting of 1150 Finns with 5365 major histocompatibility complex region SNPs consistent between genome builds. We evaluated the accuracy of the panel against a European panel in an independent test set of 213 Finnish subjects. We show that the Finnish panel yields a lower imputation error rate (1.24% versus 1.79%). More than 30% of imputation errors occurred in haplotypes enriched in Finland. The frequencies of imputed HLA alleles were highly correlated with clinical-grade HLA allele frequencies and allowed accurate replication of established HLA–disease associations in ∼102 000 biobank participants. The results show that a population-specific reference increases imputation accuracy in a relatively isolated population within Europe and can be successfully applied to biobank-scale genome data collections.



2021 ◽  
Author(s):  
Su Wang ◽  
Miran Kim ◽  
Xiaoqian Jiang ◽  
Arif Ozgun Harmanci

The decreasing cost of DNA sequencing has led to a great increase in our knowledge about genetic variation. While population-scale projects bring important insight into genotype-phenotype relationships, the cost of performing whole-genome sequencing on large samples is still prohibitive. In-silico genotype imputation coupled with genotyping-by-arrays is a cost-effective and accurate alternative for genotyping of common and uncommon variants. Imputation methods compare the genotypes of the typed variants with the large population-specific reference panels and estimate the genotypes of untyped variants by making use of the linkage disequilibrium patterns. Most accurate imputation methods are based on the Li-Stephens hidden Markov model, HMM, that treats the sequence of each chromosome as a mosaic of the haplotypes from the reference panel. Here we assess the accuracy of local-HMMs, where each untyped variant is imputed using the typed variants in a small window around itself (as small as 1 centimorgan). Locality-based imputation is used recently by machine learning-based genotype imputation approaches. We assess how the parameters of the local-HMMs impact the imputation accuracy in a comprehensive set of benchmarks and show that local-HMMs can accurately impute common and uncommon variants and can be relaxed to impute rare variants as well. The source code for the local HMM implementations is publicly available at https://github.com/harmancilab/LoHaMMer.



2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Wendell Jones ◽  
Binsheng Gong ◽  
Natalia Novoradovskaya ◽  
Dan Li ◽  
Rebecca Kusko ◽  
...  

Abstract Background Oncopanel genomic testing, which identifies important somatic variants, is increasingly common in medical practice and especially in clinical trials. Currently, there is a paucity of reliable genomic reference samples having a suitably large number of pre-identified variants for properly assessing oncopanel assay analytical quality and performance. The FDA-led Sequencing and Quality Control Phase 2 (SEQC2) consortium analyze ten diverse cancer cell lines individually and their pool, termed Sample A, to develop a reference sample with suitably large numbers of coding positions with known (variant) positives and negatives for properly evaluating oncopanel analytical performance. Results In reference Sample A, we identify more than 40,000 variants down to 1% allele frequency with more than 25,000 variants having less than 20% allele frequency with 1653 variants in COSMIC-related genes. This is 5–100× more than existing commercially available samples. We also identify an unprecedented number of negative positions in coding regions, allowing statistical rigor in assessing limit-of-detection, sensitivity, and precision. Over 300 loci are randomly selected and independently verified via droplet digital PCR with 100% concordance. Agilent normal reference Sample B can be admixed with Sample A to create new samples with a similar number of known variants at much lower allele frequency than what exists in Sample A natively, including known variants having allele frequency of 0.02%, a range suitable for assessing liquid biopsy panels. Conclusion These new reference samples and their admixtures provide superior capability for performing oncopanel quality control, analytical accuracy, and validation for small to large oncopanels and liquid biopsy assays.



2020 ◽  
Vol 98 (Supplement_3) ◽  
pp. 25-25
Author(s):  
Austin M Putz ◽  
Patrick Charagu ◽  
Abe Huisman

Abstract Two commonly used population structure software packages are freely available for breed authentication, Structure and Admixture. Structure uses a Bayesian approach to model population structure, while Admixture uses a frequentist approach. More recently, an allele frequency method has been updated to use quadratic programming to constrain the multiple linear regression coefficients of the regression of genotype count (divided by two) on the matrix of allele frequencies for each known breed or line. This constraint forced coefficients to sum to one and be greater than or equal to 0 and less than or equal to 1. The goal of this research was to compare and contrast these three methods to determine the breed/line authenticity for each of the five genetic lines. These five lines included Large White, Landrace, a lean Duroc, a meat quality Duroc, and a Pietrain line. Only animals with a 50K SNP panel were used in this analysis. Analyses were run five times for Structure and Admixture to check repeatability. The allele frequency method did not need to be repeated because it remains the same as long as the reference allele frequency matrix stays constant. For Structure, results of breed composition were inconsistent across replicates. Structure separated at least one of the maternal lines in three out of the five replicates with only 500 animals and kept the Duroc lines together as one population. Only 500 animals could be utilized in each run of Structure due to computational restraints. Admixture was very consistent across runs for each animal, but also failed to separate the two Duroc lines, instead splitting one of the two maternal lines. Finally, the allele frequency method split all five lines correctly and was 100% reproducible as long as the reference allele frequency matrix stays the same across runs.



2020 ◽  
Author(s):  
Celine Charon ◽  
Rodrigue Allodji ◽  
Vincent Meyer ◽  
Jean-François Deleuze

Abstract Quality control methods for genome-wide association studies and fine mapping are commonly used for imputation, however, they result in loss of many single nucleotide polymorphisms (SNPs). To investigate the consequences of filtration on imputation, we studied the direct effects on the number of markers, their allele frequencies, imputation quality scores and post-filtration events. We pre-phrased 1,031 genotyped individuals from diverse ethnicities and compared the imputed variants to 1,089 NCBI recorded individuals for additional validation.Without variant pre-filtration based on quality control (QC), we observed no impairment in the imputation of SNPs that failed QC whereas with pre-filtration there was an overall loss of information. Significant differences between frequencies with and without pre-filtration were found only in the range of very rare (5E-04-1E-03) and rare variants (1E-03-5E-03) (p < 1E-04). Increasing the post-filtration imputation quality score from 0.3 to 0.8 reduced the number of single nucleotide variants (SNVs) <0.001 2.5 fold with or without QC pre-filtration and halved the number of very rare variants (5E-04). As a result, to maintain confidence and enough SNVs, we propose here a 2-step post-filtration approach to increase the number of very rare and rare variants compared to conservative post-filtration methods.



2020 ◽  
Vol 6 (22) ◽  
pp. eaaz7835 ◽  
Author(s):  
Sungwon Jeon ◽  
Youngjune Bhak ◽  
Yeonsong Choi ◽  
Yeonsu Jeon ◽  
Seunghoon Kim ◽  
...  

We present the initial phase of the Korean Genome Project (Korea1K), including 1094 whole genomes (sequenced at an average depth of 31×), along with data of 79 quantitative clinical traits. We identified 39 million single-nucleotide variants and indels of which half were singleton or doubleton and detected Korean-specific patterns based on several types of genomic variations. A genome-wide association study illustrated the power of whole-genome sequences for analyzing clinical traits, identifying nine more significant candidate alleles than previously reported from the same linkage disequilibrium blocks. Also, Korea1K, as a reference, showed better imputation accuracy for Koreans than the 1KGP panel. As proof of utility, germline variants in cancer samples could be filtered out more effectively when the Korea1K variome was used as a panel of normals compared to non-Korean variome sets. Overall, this study shows that Korea1K can be a useful genotypic and phenotypic resource for clinical and ethnogenetic studies.



Sign in / Sign up

Export Citation Format

Share Document