Open-Access STRS Database Of Populations From The 1000 Genomes Project Using High Coverage Phase 3 Data

Mapping Intimacies ◽

10.1101/2021.09.06.459168 ◽

2021 ◽

Author(s):

Tamara Soledad Frontanilla ◽

Guilherme Valle Silva ◽

Jesus Ayala ◽

Celso Teixeira Mendes

Keyword(s):

Open Access ◽

Forensic Genetics ◽

Allele Frequencies ◽

Equilibrium Analysis ◽

1000 Genomes Project ◽

Phase 3 ◽

Principal Coordinates Analysis ◽

High Coverage ◽

1000 Genomes ◽

Principal Coordinates

Accurate STR genotyping from next-generation sequencing (NGS) data has been challenging. Haplotype inference and phasing for STRs (HipSTR) was specifically developed to deal with genotyping errors and obtain reliable STR genotypes from whole-genome sequencing datasets. The objective of this investigation was to perform a comprehensive genotyping analysis of a set of STRs of broad forensic interest from the 1000 Genomes populations and release a reliable open-access STR database to the forensic genetics community. A set of 22 STR markers were analyzed using the CRAM files of the 1000 Genomes Project Phase 3 high-coverage (30x) dataset generated by the New York Genome Center (NYGC). HipSTR was used to call genotypes from 2,504 samples from 26 populations organized into five groups: African, East Asian, European, South Asian, and admixed American. The D21S11 marker could not be detected in the present study. Moreover, the Hardy-Weinberg equilibrium analysis, coupled with a comprehensive analysis of allele frequencies, revealed that HipSTR could not identify longer Penta E (and Penta D at a lesser extent) alleles. This issue is probably due to the limited length of sequencing reads available for genotype calling, resulting in heterozygote deficiency. Notwithstanding that, AMOVA, a clustering analysis using STRUCTURE, and a Principal Coordinates Analysis revealed a clear-cut separation between the four major ancestries sampled by the 1000 Genomes Consortium (AFR, EUR, EAS, SAS). Meanwhile, the AMOVA results corroborated previous reports that most of the variance is (97.12%) observed within populations. This set of analyses revealed that except for larger Penta D and Penta E alleles, allele frequencies and genotypes defined by HipSTR from the 1000 Genomes Project phase 3 data and offered as an open-access database are consistent and highly reliable.

Major sex differences in allele frequencies for X chromosome variants in the 1000 Genomes Project data

10.1101/2021.10.27.466015 ◽

2021 ◽

Author(s):

Zhong Wang ◽

Lei Sun ◽

Andrew D Paterson

Keyword(s):

Sex Differences ◽

X Chromosome ◽

Association Studies ◽

Allele Frequencies ◽

Whole Genome Sequence ◽

P Value ◽

1000 Genomes Project ◽

Phase 3 ◽

High Coverage ◽

1000 Genomes

An unexpectedly high proportion of SNPs on the X chromosome in the 1000 Genomes Project phase 3 data were identified with significant sex differences in minor allele frequencies (sdMAF). sdMAF persisted for many of these SNPs in the recently released high coverage whole genome sequence, and it was consistent between the five super-populations. Among the 245,825 common biallelic SNPs in phase 3 data presumed to be high quality, 2,039 have genome-wide significant sdMAF (p-value <5e-8). sdMAF varied by location: (NPR)=0.83%, pseudo-autosomal region (PAR1)=0.29%, PAR2=13.1%, and PAR3=0.85% of SNPs had sdMAF, and they were clustered at the NPR-PAR boundaries, among others. sdMAF at the NPR-PAR boundaries are biologically expected due to sex-linkage, but have generally been ignored in association studies. For comparison, similar analyses found only 6, 1 and 0 SNPs with significant sdMAF on chromosomes 1, 7 and 22, respectively. Future X chromosome analyses need to take sdMAF into account.

High coverage whole genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios

10.1101/2021.02.06.430068 ◽

2021 ◽

Cited By ~ 4

Author(s):

Marta Byrska-Bishop ◽

Uday S. Evani ◽

Xuefang Zhao ◽

Anna O. Basile ◽

Haley J. Abel ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Sequence Data ◽

Whole Genome ◽

1000 Genomes Project ◽

Phase 3 ◽

High Coverage ◽

Entire Cohort ◽

1000 Genomes ◽

Low Coverage

ABSTRACTThe 1000 Genomes Project (1kGP), launched in 2008, is the largest fully open resource of whole genome sequencing (WGS) data consented for public distribution of raw sequence data without access or use restrictions. The final (phase 3) 2015 release of 1kGP included 2,504 unrelated samples from 26 populations, representing five continental regions of the world and was based on a combination of technologies including low coverage WGS (mean depth 7.4X), high coverage whole exome sequencing (mean depth 65.7X), and microarray genotyping. Here, we present a new, high coverage WGS resource encompassing the original 2,504 1kGP samples, as well as an additional 698 related samples that result in 602 complete trios in the 1kGP cohort. We sequenced this expanded 1kGP cohort of 3,202 samples to a targeted depth of 30X using Illumina NovaSeq 6000 instruments. We performed SNV/INDEL calling against the GRCh38 reference using GATK’s HaplotypeCaller, and generated a comprehensive set of SVs by integrating multiple analytic methods through a sophisticated machine learning model, upgrading the 1kGP dataset to current state-of-the-art standards. Using this strategy, we defined over 111 million SNVs, 14 million INDELs, and ∼170 thousand SVs across the entire cohort of 3,202 samples with estimated false discovery rate (FDR) of 0.3%, 1.0%, and 1.8%, respectively. By comparison to the low-coverage phase 3 callset, we observed substantial improvements in variant discovery and estimated FDR that were facilitated by high coverage re-sequencing and expansion of the cohort. Specifically, we called 7% more SNVs, 59% more INDELs, and 170% more SVs per genome than the phase 3 callset. Moreover, we leveraged the presence of families in the cohort to achieve superior haplotype phasing accuracy and we demonstrate improvements that the high coverage panel brings especially for INDEL imputation. We make all the data generated as part of this project publicly available and we envision this updated version of the 1kGP callset to become the new de facto public resource for the worldwide scientific community working on genomics and genetics.

The Epidemiology and Genetics of Hyperuricemia and Gout across Major Racial Groups: A Literature Review and Population Genetics Secondary Database Analysis

Journal of Personalized Medicine ◽

10.3390/jpm11030231 ◽

2021 ◽

Vol 11 (3) ◽

pp. 231

Author(s):

Faven Butler ◽

Ali Alghubayshi ◽

Youssef Roman

Keyword(s):

Literature Review ◽

Risk Allele ◽

Statistical Significance ◽

Elevated Serum ◽

The United States ◽

Allele Frequencies ◽

Racial Groups ◽

1000 Genomes Project ◽

1000 Genomes ◽

Risk Alleles

Gout is an inflammatory condition caused by elevated serum urate (SU), a condition known as hyperuricemia (HU). Genetic variations, including single nucleotide polymorphisms (SNPs), can alter the function of urate transporters, leading to differential HU and gout prevalence across different populations. In the United States (U.S.), gout prevalence differentially affects certain racial groups. The objective of this proposed analysis is to compare the frequency of urate-related genetic risk alleles between Europeans (EUR) and the following major racial groups: Africans in Southwest U.S. (ASW), Han-Chinese (CHS), Japanese (JPT), and Mexican (MXL) from the 1000 Genomes Project. The Ensembl genome browser of the 1000 Genomes Project was used to conduct cross-population allele frequency comparisons of 11 SNPs across 11 genes, physiologically involved and significantly associated with SU levels and gout risk. Gene/SNP pairs included: ABCG2 (rs2231142), SLC2A9 (rs734553), SLC17A1 (rs1183201), SLC16A9 (rs1171614), GCKR (rs1260326), SLC22A11 (rs2078267), SLC22A12 (rs505802), INHBC (rs3741414), RREB1 (rs675209), PDZK1 (rs12129861), and NRXN2 (rs478607). Allele frequencies were compared to EUR using Chi-Square or Fisher’s Exact test, when appropriate. Bonferroni correction for multiple comparisons was used, with p < 0.0045 for statistical significance. Risk alleles were defined as the allele that is associated with baseline or higher HU and gout risks. The cumulative HU or gout risk allele index of the 11 SNPs was estimated for each population. The prevalence of HU and gout in U.S. and non-US populations was evaluated using published epidemiological data and literature review. Compared with EUR, the SNP frequencies of 7/11 in ASW, 9/11 in MXL, 9/11 JPT, and 11/11 CHS were significantly different. HU or gout risk allele indices were 5, 6, 9, and 11 in ASW, MXL, CHS, and JPT, respectively. Out of the 11 SNPs, the percentage of risk alleles in CHS and JPT was 100%. Compared to non-US populations, the prevalence of HU and gout appear to be higher in western world countries. Compared with EUR, CHS and JPT populations had the highest HU or gout risk allele frequencies, followed by MXL and ASW. These results suggest that individuals of Asian descent are at higher HU and gout risk, which may partly explain the nearly three-fold higher gout prevalence among Asians versus Caucasians in ambulatory care settings. Furthermore, gout remains a disease of developed countries with a marked global rising.

Ancestral Spectrum Analysis With Population-Specific Variants

Frontiers in Genetics ◽

10.3389/fgene.2021.724638 ◽

2021 ◽

Vol 12 ◽

Author(s):

Gang Shi ◽

Qingmin Kuang

Keyword(s):

Nucleotide Polymorphisms ◽

Sequencing Data ◽

1000 Genomes Project ◽

Specific Population ◽

High Coverage ◽

Single Nucleotide ◽

Target Populations ◽

1000 Genomes ◽

Sequencing Studies ◽

Best Linear Unbiased

With the advance of sequencing technology, an increasing number of populations have been sequenced to study the histories of worldwide populations, including their divergence, admixtures, migration, and effective sizes. The variants detected in sequencing studies are largely rare and mostly population specific. Population-specific variants are often recent mutations and are informative for revealing substructures and admixtures in populations; however, computational methods and tools to analyze them are still lacking. In this work, we propose using reference populations and single nucleotide polymorphisms (SNPs) specific to the reference populations. Ancestral information, the best linear unbiased estimator (BLUE) of the ancestral proportion, is proposed, which can be used to infer ancestral proportions in recently admixed target populations and measure the extent to which reference populations serve as good proxies for the admixing sources. Based on the same panel of SNPs, the ancestral information is comparable across samples from different studies and is not affected by genetic outliers, related samples, or the sample sizes of the admixed target populations. In addition, ancestral spectrum is useful for detecting genetic outliers or exploring co-ancestry between study samples and the reference populations. The methods are implemented in a program, Ancestral Spectrum Analyzer (ASA), and are applied in analyzing high-coverage sequencing data from the 1000 Genomes Project and the Human Genome Diversity Project (HGDP). In the analyses of American populations from the 1000 Genomes Project, we demonstrate that recent admixtures can be dissected from ancient admixtures by comparing ancestral spectra with and without indigenous Americans being included in the reference populations.

Efficient analysis of large datasets and sex bias with ADMIXTURE

10.1101/039347 ◽

2016 ◽

Cited By ~ 1

Author(s):

Suyash S. Shringarpure ◽

Carlos D. Bustamante ◽

Kenneth L. Lange ◽

David H. Alexander

Keyword(s):

Large Datasets ◽

Allele Frequencies ◽

Sex Bias ◽

1000 Genomes Project ◽

1000 Genomes ◽

Males And Females ◽

Related Individuals ◽

Using Data ◽

Human Ancestry ◽

Individual Ancestry

Background: A number of large genomic datasets are being generated for studies of human ancestry and diseases. The ADMIXTURE program is commonly used to infer individual ancestry from genomic data. Results: We describe two improvements to the ADMIXTURE software. The first enables ADMIXTURE to infer ancestry for a new set of individuals using cluster allele frequencies from a reference set of individuals. Using data from the 1000 Genomes Project, we show that this allows ADMIXTURE to infer ancestry for 10,920 individuals in a few hours (a 5x speedup). This mode also allows ADMIXTURE to correctly estimate individual ancestry and allele frequencies from a set of related individuals. The second modification allows ADMIXTURE to correctly handle X-chromosome (and other haploid) data from both males and females. We demonstrate increased power to detect sex-biased admixture in African-American individuals from the 1000 Genomes project using this extension. Conclusions: These modifications make ADMIXTURE more efficient and versatile, allowing users to extract more information from large genomic datasets.

Inference of recent admixture using genotype data

10.1101/2020.09.16.300640 ◽

2020 ◽

Author(s):

Peter Pfaffelhuber ◽

Elisabeth Sester-Huss ◽

Franz Baumdicker ◽

Jana Naue ◽

Sabine Lutz-Bonengel ◽

...

Keyword(s):

State Of The Art ◽

Forensic Genetics ◽

Statistical Test ◽

Genotype Data ◽

1000 Genomes Project ◽

Additional Information ◽

1000 Genomes ◽

Project Data ◽

Ancestry Proportions ◽

Individual Ancestry

AbstractThe inference of biogeographic ancestry (BGA) has become a focus of forensic genetics. Mis-inference of BGA can have profound unwanted consequences for investigations and society. We show that recent admixture can lead to misclassification and erroneous inference of ancestry proportions, using state of the art analysis tools with (i) simulations, (ii) 1000 genomes project data, and (iii) two individuals analyzed using the ForenSeq DNA Signature Prep Kit. Subsequently, we extend existing tools for estimation of individual ancestry (IA) by allowing for different IA in both parents, leading to estimates of parental individual ancestry (PIA), and a statistical test for recent admixture. Estimation of PIA outperforms IA in most scenarios of recent admixture. Furthermore, additional information about parental ancestry can be acquired with PIA that may guide casework.

Detecting shared independent selection

10.1101/2020.04.21.053959 ◽

2020 ◽

Author(s):

Nathan S. Harris ◽

Alan R. Rogers

Keyword(s):

1000 Genomes Project ◽

Phase 3 ◽

1000 Genomes ◽

Different Populations ◽

Genomic Regions ◽

Two Populations

AbstractSignals of selection are not often shared between populations. When a mutual signal is detected, it is often not known if selection occurred before or after populations split. Here we develop a method to detect genomic regions at which selection has favored different haplotypes in two populations. This method is verified through simulations and tested on small regions of the genome. This method was then expanded to scan the phase 3 genomes of the 1000 Genomes Project populations for regions in which the evidence for independent selection is strongest. We identify several genes which likely underwent selection independently in different populations.

The International Genome Sample Resource (IGSR) collection of open human genomic variation resources

Nucleic Acids Research ◽

10.1093/nar/gkz836 ◽

2019 ◽

Vol 48 (D1) ◽

pp. D941-D947 ◽

Cited By ~ 20

Author(s):

Susan Fairley ◽

Ernesto Lowy-Gallego ◽

Emily Perry ◽

Paul Flicek

Keyword(s):

Sequence Data ◽

Genomic Variation ◽

1000 Genomes Project ◽

High Coverage ◽

Web Based ◽

1000 Genomes ◽

Open Consent ◽

Unified View ◽

Human Genomic ◽

Project Data

Abstract To sustain and develop the largest fully open human genomic resources the International Genome Sample Resource (IGSR) (https://www.internationalgenome.org) was established. It is built on the foundation of the 1000 Genomes Project, which created the largest openly accessible catalogue of human genomic variation developed from samples spanning five continents. IGSR (i) maintains access to 1000 Genomes Project resources, (ii) updates 1000 Genomes Project resources to the GRCh38 human reference assembly, (iii) adds new data generated on 1000 Genomes Project cell lines, (iv) shares data from samples with a similarly open consent to increase the number of samples and populations represented in the resources and (v) provides support to users of these resources. Among recent updates are the release of variation calls from 1000 Genomes Project data calculated directly on GRCh38 and the addition of high coverage sequence data for the 2504 samples in the 1000 Genomes Project phase three panel. The data portal, which facilitates web-based exploration of the IGSR resources, has been updated to include samples which were not part of the 1000 Genomes Project and now presents a unified view of data and samples across almost 5000 samples from multiple studies. All data is fully open and publicly accessible.

Copy numbers of 45S and 5S ribosomal DNA arrays lack meaningful correlation in humans

10.1101/2020.07.06.189753 ◽

2020 ◽

Author(s):

Ashley N. Hall ◽

Tychele N. Turner ◽

Christine Queitsch

Keyword(s):

Copy Number Variation ◽

Ribosomal Dna ◽

Copy Number ◽

1000 Genomes Project ◽

High Coverage ◽

1000 Genomes ◽

Copy Numbers ◽

Number Variation ◽

Rdna Copy ◽

Rdna Copy Number

AbstractThe ribosomal DNA genes are tandemly arrayed in most eukaryotes and exhibit vast copy number variation. There is growing interest in integrating this variation into genotype-phenotype associations. Here, we explored a possible association of rDNA copy number variation with autism spectrum disorder and found no difference between probands and unaffected siblings. However, rDNA copy number estimates from whole genome sequencing are error-prone, so we sought to use pulsed-field gel electrophoresis, a classic gold-standard method, to validate rDNA copy number genotypes. The electrophoresis approach is not readily applicable to the human 45S arrays due to their size and location on five separate chromosomes; however, it should accurately resolve copy numbers for the shorter 5S arrays that reside on a single chromosome. Previous studies reported tightly correlated, concerted copy number variation between the 45S and 5S arrays, which should enable the validation of 45S copy number estimates with CHEF-gel-verified 5S copy numbers. Here, we show that the previously reported strong concerted copy number variation is likely an artifact of variable data quality in the earlier published 1000 Genomes Project sequences. We failed to detect a meaningful correlation between 45S and 5S copy numbers in the large, high-coverage Simons Simplex Collection dataset as well as in the recent high-coverage 1000 Genomes Project sequences. Our findings illustrate the challenge of genotyping repetitive DNA regions accurately and call into question the accuracy of recently published studies of rDNA copy number variation in cancers and aging that relied on diverse publicly available resources for sequence data.

Fine Mapping Analysis of the MHC Region to Identify Variants Associated With Chinese Vitiligo and SLE and Association Across These Diseases

Frontiers in Immunology ◽

10.3389/fimmu.2021.758652 ◽

2022 ◽

Vol 12 ◽

Author(s):

Lu Cao ◽

Ruixue Zhang ◽

Yirui Wang ◽

Xia Hu ◽

Liang Yong ◽

...

Keyword(s):

Autoimmune Diseases ◽

Fine Mapping ◽

Epidemiological Studies ◽

1000 Genomes Project ◽

Phase 3 ◽

1000 Genomes ◽

Mapping Analysis ◽

Healthy Control ◽

New Perspective

The important role of MHC in the pathogenesis of vitiligo and SLE has been confirmed in various populations. To map the most significant MHC variants associated with the risk of vitiligo and SLE, we conducted fine mapping analysis using 1117 vitiligo cases, 1046 SLE cases and 1693 healthy control subjects in the Han-MHC reference panel and 1000 Genomes Project phase 3. rs113465897 (P=1.03×10-13, OR=1.64, 95%CI =1.44–1.87) and rs3129898 (P=4.21×10-17, OR=1.93, 95%CI=1.66–2.25) were identified as being most strongly associated with vitiligo and SLE, respectively. Stepwise conditional analysis revealed additional independent signals at rs3130969(p=1.48×10-7, OR=0.69, 95%CI=0.60–0.79), HLA-DPB1*03:01 (p=1.07×10-6, OR=1.94, 95%CI=1.49–2.53) being linked to vitiligo and HLA-DQB1*0301 (P=4.53×10-7, OR=0.62, 95%CI=0.52-0.75) to SLE. Considering that epidemiological studies have confirmed comorbidities of vitiligo and SLE, we used the GCTA tool to analyse the genetic correlation between these two diseases in the HLA region, the correlation coefficient was 0.79 (P=5.99×10-10, SE=0.07), confirming their similar genetic backgrounds. Our findings highlight the value of the MHC region in vitiligo and SLE and provide a new perspective for comorbidities among autoimmune diseases.