scholarly journals Integrating sequence and array data to create an improved 1000 Genomes Project haplotype reference panel

2014 ◽  
Vol 5 (1) ◽  
Author(s):  
Olivier Delaneau ◽  
◽  
Jonathan Marchini
2016 ◽  
Author(s):  
G. David Poznik

AbstractWe have developed an algorithm to rapidly and accurately identify the Y-chromosome haplogroup of each male in a sample of one to millions. The algorithm, implemented in the yHaplo* software package (yHaplo), does not rely on any particular genotyping modality or platform. Full sequences yield the most granular haplogroup classifications, but genotyping arrays can yield reliable calls, provided a reasonable number of phylogenetically informative variants has been assayed. The algorithm is robust to missing data, genotype errors, mutation recurrence, and other complications. We have tested the software on full sequences from phase 3 of the 1000 Genomes Project and on subsets thereof constructed by downsampling to SNPs present on each of four genotyping arrays. We have also run the software on array data from more than 600,000 males.


2018 ◽  
Author(s):  
Saurabh Belsare ◽  
Michal Sakin-Levy ◽  
Yulia Mostovoy ◽  
Steffen Durinck ◽  
Subhra Chaudhry ◽  
...  

ABSTRACTData from the 1000 Genomes project is quite often used as a reference for human genomic analysis. However, its accuracy needs to be assessed to understand the quality of predictions made using this reference. We present here an assessment of the genotype, phasing, and imputation accuracy data in the 1000 Genomes project. We compare the phased haplotype calls from the 1000 Genomes project to experimentally phased haplotypes for 28 of the same individuals sequenced using the 10X Genomics platform. We observe that phasing and imputation for rare variants are unreliable, which likely reflects the limited sample size of the 1000 Genomes project data. Further, it appears that using a population specific reference panel does not improve the accuracy of imputation over using the entire 1000 Genomes data set as a reference panel. We also note that the error rates and trends depend on the choice of definition of error, and hence any error reporting needs to take these definitions into account.


2015 ◽  
Vol 6 (1) ◽  
Author(s):  
María Soler Artigas ◽  
◽  
Louise V. Wain ◽  
Suzanne Miller ◽  
Abdul Kader Kheirallah ◽  
...  

2019 ◽  
Author(s):  
Madeline H. Kowalski ◽  
Huijun Qian ◽  
Ziyi Hou ◽  
Jonathan D. Rosen ◽  
Amanda L. Tapia ◽  
...  

AbstractMost genome-wide association and fine-mapping studies to date have been conducted in individuals of European descent, and genetic studies of populations of Hispanic/Latino and African ancestry are still limited. In addition to the limited inclusion of these populations in genetic studies, these populations have more complex linkage disequilibrium structure that may reduce the number of variants associated with a phenotype. In order to better define the genetic architecture of these understudied populations, we leveraged >100,000 phased sequences available from deep-coverage whole genome sequencing through the multi-ethnic NHLBI Trans-Omics for Precision Medicine (TOPMed) program to impute genotypes into admixed African and Hispanic/Latino samples with commercial genome-wide genotyping array data. We demonstrate that using TOPMed sequencing data as the imputation reference panel improves genotype imputation quality in these populations, which subsequently enhances gene-mapping power for complex traits. For rare variants with minor allele frequency (MAF) < 0.5%, we observed a 2.3 to 6.1-fold increase in the number of well-imputed variants, with 11-34% improvement in average imputation quality, compared to the state-of-the-art 1000 Genomes Project Phase 3 and Haplotype Reference Consortium reference panels, respectively. Impressively, even for extremely rare variants with sample minor allele count <10 (including singletons) in the imputation target samples, average information content rescued was >86%. Subsequent association analyses of TOPMed reference panel-imputed genotype data with hematological traits (hemoglobin (HGB), hematocrit (HCT), and white blood cell count (WBC)) in ~20,000 self-identified African descent individuals and ~23,000 self-identified Hispanic/Latino individuals identified associations with two rare variants in theHBBgene (rs33930165 with higher WBC (p=8.1×10−12) in African populations, rs11549407 with lower HGB (p=1.59×10−12) and HCT (p=1.13×10−9) in Hispanics/Latinos). By comparison, neither variant would have been genome-wide significant if either 1000 Genomes Project Phase 3 or Haplotype Reference Consortium reference panels had been used for imputation. Our findings highlight the utility of TOPMed imputation reference panel for identification of novel associations between rare variants and complex traits not previously detected in similar sized genome-wide studies of under-represented African and Hispanic/Latino populations.Author summaryAdmixed African and Hispanic/Latino populations remain understudied in genome-wide association and fine-mapping studies of complex diseases. These populations have more complex linkage disequilibrium (LD) structure that can impair mapping of variants associated with complex diseases and their risk factors. Genotype imputation represents an approach to improve genome coverage, especially for rare or ancestry-specific variation; however, these understudied populations also have smaller relevant imputation reference panels that need to be expanded to represent their more complex LD patterns. In this study, we leveraged >100,000 phased sequences generated from the multi-ethnic NHLBI TOPMed project to impute in admixed cohorts encompassing ~20,000 individuals of African ancestry (AAs) and ~23,000 Hispanics/Latinos. We demonstrated substantially higher imputation quality for low frequency and rare variants in comparison to the state-of-the-art reference panels (1000 Genomes Project and Haplotype Reference Consortium). Association analyses of ~35 million (AAs) and ~27 million (Hispanics/Latinos) variants passing stringent post-imputation filtering with quantitative hematological traits led to the discovery of associations with two rare variants in theHBBgene; one of these variants was replicated in an independent sample, and the other is known to cause anemia in the homozygous state. By comparison, the sameHBBvariants would not have been genome-wide significant using other state-of-the-art reference panels due to lower imputation quality. Our findings demonstrate the power of the TOPMed whole genome sequencing data for imputation and subsequent association analysis in admixed African and Hispanic/Latino populations.


2021 ◽  
Vol 11 (3) ◽  
pp. 231
Author(s):  
Faven Butler ◽  
Ali Alghubayshi ◽  
Youssef Roman

Gout is an inflammatory condition caused by elevated serum urate (SU), a condition known as hyperuricemia (HU). Genetic variations, including single nucleotide polymorphisms (SNPs), can alter the function of urate transporters, leading to differential HU and gout prevalence across different populations. In the United States (U.S.), gout prevalence differentially affects certain racial groups. The objective of this proposed analysis is to compare the frequency of urate-related genetic risk alleles between Europeans (EUR) and the following major racial groups: Africans in Southwest U.S. (ASW), Han-Chinese (CHS), Japanese (JPT), and Mexican (MXL) from the 1000 Genomes Project. The Ensembl genome browser of the 1000 Genomes Project was used to conduct cross-population allele frequency comparisons of 11 SNPs across 11 genes, physiologically involved and significantly associated with SU levels and gout risk. Gene/SNP pairs included: ABCG2 (rs2231142), SLC2A9 (rs734553), SLC17A1 (rs1183201), SLC16A9 (rs1171614), GCKR (rs1260326), SLC22A11 (rs2078267), SLC22A12 (rs505802), INHBC (rs3741414), RREB1 (rs675209), PDZK1 (rs12129861), and NRXN2 (rs478607). Allele frequencies were compared to EUR using Chi-Square or Fisher’s Exact test, when appropriate. Bonferroni correction for multiple comparisons was used, with p < 0.0045 for statistical significance. Risk alleles were defined as the allele that is associated with baseline or higher HU and gout risks. The cumulative HU or gout risk allele index of the 11 SNPs was estimated for each population. The prevalence of HU and gout in U.S. and non-US populations was evaluated using published epidemiological data and literature review. Compared with EUR, the SNP frequencies of 7/11 in ASW, 9/11 in MXL, 9/11 JPT, and 11/11 CHS were significantly different. HU or gout risk allele indices were 5, 6, 9, and 11 in ASW, MXL, CHS, and JPT, respectively. Out of the 11 SNPs, the percentage of risk alleles in CHS and JPT was 100%. Compared to non-US populations, the prevalence of HU and gout appear to be higher in western world countries. Compared with EUR, CHS and JPT populations had the highest HU or gout risk allele frequencies, followed by MXL and ASW. These results suggest that individuals of Asian descent are at higher HU and gout risk, which may partly explain the nearly three-fold higher gout prevalence among Asians versus Caucasians in ambulatory care settings. Furthermore, gout remains a disease of developed countries with a marked global rising.


2014 ◽  
Vol 6 (4) ◽  
pp. 846-860 ◽  
Author(s):  
Gabriel Santpere ◽  
Fleur Darre ◽  
Soledad Blanco ◽  
Antonio Alcami ◽  
Pablo Villoslada ◽  
...  

2015 ◽  
Vol 32 (9) ◽  
pp. 1366-1372 ◽  
Author(s):  
Dmitry Prokopenko ◽  
Julian Hecker ◽  
Edwin K. Silverman ◽  
Marcello Pagano ◽  
Markus M. Nöthen ◽  
...  

PLoS ONE ◽  
2021 ◽  
Vol 16 (7) ◽  
pp. e0254363
Author(s):  
Aji John ◽  
Kathleen Muenzen ◽  
Kristiina Ausmees

Advances in whole-genome sequencing have greatly reduced the cost and time of obtaining raw genetic information, but the computational requirements of analysis remain a challenge. Serverless computing has emerged as an alternative to using dedicated compute resources, but its utility has not been widely evaluated for standardized genomic workflows. In this study, we define and execute a best-practice joint variant calling workflow using the SWEEP workflow management system. We present an analysis of performance and scalability, and discuss the utility of the serverless paradigm for executing workflows in the field of genomics research. The GATK best-practice short germline joint variant calling pipeline was implemented as a SWEEP workflow comprising 18 tasks. The workflow was executed on Illumina paired-end read samples from the European and African super populations of the 1000 Genomes project phase III. Cost and runtime increased linearly with increasing sample size, although runtime was driven primarily by a single task for larger problem sizes. Execution took a minimum of around 3 hours for 2 samples, up to nearly 13 hours for 62 samples, with costs ranging from $2 to $70.


2014 ◽  
Author(s):  
Melinda A Yang ◽  
Kelley Harris ◽  
Montgomery Slatkin

We introduce a method for comparing a test genome with numerous genomes from a reference population. Sites in the test genome are given a weight w that depends on the allele frequency x in the reference population. The projection of the test genome onto the reference population is the average weight for each x, w(x). The weight is assigned in such a way that if the test genome is a random sample from the reference population, w(x)=1. Using analytic theory, numerical analysis, and simulations, we show how the projection depends on the time of population splitting, the history of admixture and changes in past population size. The projection is sensitive to small amounts of past admixture, the direction of admixture and admixture from a population not sampled (a ghost population). We compute the projection of several human and two archaic genomes onto three reference populations from the 1000 Genomes project, Europeans (CEU), Han Chinese (CHB) and Yoruba (YRI) and discuss the consistency of our analysis with previously published results for European and Yoruba demographic history. Including higher amounts of admixture between Europeans and Yoruba soon after their separation and low amounts of admixture more recently can resolve discrepancies between the projections and demographic inferences from some previous studies.


Sign in / Sign up

Export Citation Format

Share Document