scholarly journals Mapping bias overestimates reference allele frequencies at theHLAgenes in the 1000 Genomes Project phase I data

2014 ◽  
Author(s):  
Debora Yoshihara Caldeira Brandt ◽  
Vitor Rezende da Costa Aguiar ◽  
Bárbara Domingues Bitarello ◽  
Kelly Nunes ◽  
Jérôme Goudet ◽  
...  

Next Generation Sequencing (NGS) technologies have become the standard for data generation in studies of population genomics, as the 1000 Genomes Project (1000G). However, these techniques are known to be problematic when applied to highly polymorphic genomic regions, such as the Human Leukocyte Antigen (HLA) genes. Because accurate genotype calls and allele frequency estimations are crucial to population genomics analises, it is important to assess the reliability of NGS data. Here, we evaluate the reliability of genotype calls and allele frequency estimates of the SNPs reported by 1000G (phase I) at five HLA genes (HLA-A, -B, -C, -DRB1, -DQB1 ). We take advantage of the availability of HLA Sanger sequencing of 930 of the 1,092 1000G samples, and use this as a gold standard to benchmark the 1000G data. We document that 18.6% of SNP genotype calls in HLA genes are incorrect, and that allele frequencies are estimated with an error higher than ??0.1 at approximately 25% of the SNPs in HLA genes. We found a bias towards overestimation of reference allele frequency for the 1000G data, indicating mapping bias is an important cause of error in frequency estimation in this dataset. We provide a list of sites that have poor allele frequency estimates, and discuss the outcomes of including those sites in different kinds of analyses. Since the HLA region is the most polymorphic in the human genome, our results provide insights into the challenges of using of NGS data at other genomic regions of high diversity.


2015 ◽  
Vol 5 (5) ◽  
pp. 931-941 ◽  
Author(s):  
Débora Y. C. Brandt ◽  
Vitor R. C. Aguiar ◽  
Bárbara D. Bitarello ◽  
Kelly Nunes ◽  
Jérôme Goudet ◽  
...  


PeerJ ◽  
2021 ◽  
Vol 9 ◽  
pp. e12294
Author(s):  
Neeraj Bharti ◽  
Ruma Banerjee ◽  
Archana Achalere ◽  
Sunitha Manjari Kasibhatla ◽  
Rajendra Joshi

Objectives Reliable identification of population-specific variants is important for building the single nucleotide polymorphism (SNP) profile. In this study, genomic variation using allele frequency differences of pharmacologically important genes for Gujarati Indians in Houston (GIH) and Indian Telugu in the U.K. (ITU) from the 1000 Genomes Project vis-à-vis global population data was studied to understand its role in drug response. Methods Joint genotyping approach was used to derive variants of GIH and ITU independently. SNPs of both these populations with significant allele frequency variation (minor allele frequency ≥ 0.05) with super-populations from the 1000 Genomes Project and gnomAD based on Chi-square distribution with p-value of ≤ 0.05 and Bonferroni’s multiple adjustment tests were identified. Population stratification and fixation index analysis was carried out to understand genetic differentiation. Functional annotation of variants was carried out using SnpEff, VEP and CADD score. Results Population stratification of VIP genes revealed four clusters viz., single cluster of GIH and ITU, one cluster each of East Asian, European, African populations and Admixed American was found to be admixed. A total of 13 SNPs belonging to ten pharmacogenes were identified to have significant allele frequency variation in both GIH and ITU populations as compared to one or more super-populations. These SNPs belong to VKORC1 (rs17708472, rs2359612, rs8050894) involved in Vitamin K cycle, cytochrome P450 isoforms CYP2C9 (rs1057910), CYP2B6 (rs3211371), CYP2A2 (rs4646425) and CYP2A4 (rs4646440); ATP-binding cassette (ABC) transporter ABCB1 (rs12720067), DPYD1 (rs12119882, rs56160474) involved in pyrimidine metabolism, methyltransferase COMT (rs9332377) and transcriptional factor NR1I2 (rs6785049). SNPs rs1544410 (VDR), rs2725264 (ABCG2), rs5215 and rs5219 (KCNJ11) share high fixation index (≥ 0.5) with either EAS/AFR populations. Missense variants rs1057910 (CYP2C9), rs1801028 (DRD2) and rs1138272 (GSTP1), rs116855232 (NUDT15); intronic variants rs1131341 (NQO1) and rs115349832 (DPYD) are identified to be ‘deleterious’. Conclusions Analysis of SNPs pertaining to pharmacogenes in GIH and ITU populations using population structure, fixation index and allele frequency variation provides a premise for understanding the role of genetic diversity in drug response in Asian Indians.



2020 ◽  
Author(s):  
Nathan S. Harris ◽  
Alan R. Rogers

AbstractSignals of selection are not often shared between populations. When a mutual signal is detected, it is often not known if selection occurred before or after populations split. Here we develop a method to detect genomic regions at which selection has favored different haplotypes in two populations. This method is verified through simulations and tested on small regions of the genome. This method was then expanded to scan the phase 3 genomes of the 1000 Genomes Project populations for regions in which the evidence for independent selection is strongest. We identify several genes which likely underwent selection independently in different populations.



2014 ◽  
Vol 12 (03) ◽  
pp. 1450011 ◽  
Author(s):  
Branko Borštnik ◽  
Danilo Pumpernik

We claim that the apparently enhanced CpG transversions in the form CpG to CpC/GpG or to ApG/CpT are caused by the hypermutable CpG to CpA/TpG transition. The nucleotide replacement counts obtained from the human/chimpanzee/gorilla/orangutan sequence alignments representing the replacements due to the evolutionary species divergence and the results of 1000 genomes project that provide us with the differences due to the intraspecies diversification were analyzed to estimate the ratio of CpG versus non-CpG transversion probabilities. The trinucleotide replacement counts were extracted from the regions that are free of functional constraints. The CpG transversion probabilities based upon the genomic comparisons were found to exceed more than twice the non-CpG transversions. The diversity data emerging from 14 population groups were partitioned in five classes as a function of the parameter quantifying the spread of the polymorphic allele among the group of individuals. The results based upon the human polymorphism exhibit a trend where CpG over non-CpG transversion probability ratio is less and less exceeding unity as the values of the derived allele frequency (DAF) of snps are diminishing. A computer simulation of a simplified model indicates that the phenomenon of the apparent enhancement of CpG transversions can have its source in the interference of the entropic effects with the maximum likelihood methodologies.



2020 ◽  
Author(s):  
Julian Gruendner ◽  
Nicolas Wolf ◽  
Lars Tögel ◽  
Florian Haller ◽  
Hans-Ulrich Prokosch ◽  
...  

BACKGROUND The introduction of next-generation sequencing (NGS) into molecular cancer diagnostics has led to an increase in the data available for the identification and evaluation of driver mutations and for defining personalized cancer treatment regimens. The meaningful combination of omics data, ie, pathogenic gene variants and alterations with other patient data, to understand the full picture of malignancy has been challenging. OBJECTIVE This study describes the implementation of a system capable of processing, analyzing, and subsequently combining NGS data with other clinical patient data for analysis within and across institutions. METHODS On the basis of the already existing NGS analysis workflows for the identification of malignant gene variants at the Institute of Pathology of the University Hospital Erlangen, we defined basic requirements on an NGS processing and analysis pipeline and implemented a pipeline based on the GEMINI (GEnome MINIng) open source genetic variation database. For the purpose of validation, this pipeline was applied to data from the 1000 Genomes Project and subsequently to NGS data derived from 206 patients of a local hospital. We further integrated the pipeline into existing structures of data integration centers at the University Hospital Erlangen and combined NGS data with local nongenomic patient-derived data available in Fast Healthcare Interoperability Resources format. RESULTS Using data from the 1000 Genomes Project and from the patient cohort as input, the implemented system produced the same results as already established methodologies. Further, it satisfied all our identified requirements and was successfully integrated into the existing infrastructure. Finally, we showed in an exemplary analysis how the data could be quickly loaded into and analyzed in KETOS, a web-based analysis platform for statistical analysis and clinical decision support. CONCLUSIONS This study demonstrates that the GEMINI open source database can be augmented to create an NGS analysis pipeline. The pipeline generates high-quality results consistent with the already established workflows for gene variant annotation and pathological evaluation. We further demonstrate how NGS-derived genomic and other clinical data can be combined for further statistical analysis, thereby providing for data integration using standardized vocabularies and methods. Finally, we demonstrate the feasibility of the pipeline integration into hospital workflows by providing an exemplary integration into the data integration center infrastructure, which is currently being established across Germany.



2019 ◽  
Vol 35 (22) ◽  
pp. 4851-4853 ◽  
Author(s):  
Mihir A Kamat ◽  
James A Blackshaw ◽  
Robin Young ◽  
Praveen Surendran ◽  
Stephen Burgess ◽  
...  

Abstract Summary PhenoScanner is a curated database of publicly available results from large-scale genetic association studies in humans. This online tool facilitates ‘phenome scans’, where genetic variants are cross-referenced for association with many phenotypes of different types. Here we present a major update of PhenoScanner (‘PhenoScanner V2’), including over 150 million genetic variants and more than 65 billion associations (compared to 350 million associations in PhenoScanner V1) with diseases and traits, gene expression, metabolite and protein levels, and epigenetic markers. The query options have been extended to include searches by genes, genomic regions and phenotypes, as well as for genetic variants. All variants are positionally annotated using the Variant Effect Predictor and the phenotypes are mapped to Experimental Factor Ontology terms. Linkage disequilibrium statistics from the 1000 Genomes project can be used to search for phenotype associations with proxy variants. Availability and implementation PhenoScanner V2 is available at www.phenoscanner.medschl.cam.ac.uk.



2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Fadilla Wahyudi ◽  
Farhang Aghakhanian ◽  
Sadequr Rahman ◽  
Yik-Ying Teo ◽  
Michał Szpak ◽  
...  

Abstract Background In population genomics, polymorphisms that are highly differentiated between geographically separated populations are often suggestive of Darwinian positive selection. Genomic scans have highlighted several such regions in African and non-African populations, but only a handful of these have functional data that clearly associates candidate variations driving the selection process. Fine-Mapping of Adaptive Variation (FineMAV) was developed to address this in a high-throughput manner using population based whole-genome sequences generated by the 1000 Genomes Project. It pinpoints positively selected genetic variants in sequencing data by prioritizing high frequency, population-specific and functional derived alleles. Results We developed a stand-alone software that implements the FineMAV statistic. To graphically visualise the FineMAV scores, it outputs the statistics as bigWig files, which is a common file format supported by many genome browsers. It is available as a command-line and graphical user interface. The software was tested by replicating the FineMAV scores obtained using 1000 Genomes Project African, European, East and South Asian populations and subsequently applied to whole-genome sequencing datasets from Singapore and China to highlight population specific variants that can be subsequently modelled. The software tool is publicly available at https://github.com/fadilla-wahyudi/finemav. Conclusions The software tool described here determines genome-wide FineMAV scores, using low or high-coverage whole-genome sequencing datasets, that can be used to prioritize a list of population specific, highly differentiated candidate variants for in vitro or in vivo functional screens. The tool displays these scores on the human genome browsers for easy visualisation, annotation and comparison between different genomic regions in worldwide human populations.



Genes ◽  
2021 ◽  
Vol 13 (1) ◽  
pp. 44
Author(s):  
Iago Maceda ◽  
Oscar Lao

The 1000 Genomes Project (1000G) is one of the most popular whole genome sequencing datasets used in different genomics fields and has boosting our knowledge in medical and population genomics, among other fields. Recent studies have reported the presence of ghost mutation signals in the 1000G. Furthermore, studies have shown that these mutations can influence the outcomes of follow-up studies based on the genetic variation of 1000G, such as single nucleotide variants (SNV) imputation. While the overall effect of these ghost mutations can be considered negligible for common genetic variants in many populations, the potential bias remains unclear when studying low frequency genetic variants in the population. In this study, we analyze the effect of the sequencing center in predicted loss of function (LoF) alleles, the number of singletons, and the patterns of archaic introgression in the 1000G. Our results support previous studies showing that the sequencing center is associated with LoF and singletons independent of the population that is considered. Furthermore, we observed that patterns of archaic introgression were distorted for some populations depending on the sequencing center. When analyzing the frequency of SNPs showing extreme patterns of genotype differentiation among centers for CEU, YRI, CHB, and JPT, we observed that the magnitude of the sequencing batch effect was stronger at MAF < 0.2 and showed different profiles between CHB and the other populations. All these results suggest that data from 1000G must be interpreted with caution when considering statistics using variants at low frequency.



10.2196/19879 ◽  
2020 ◽  
Vol 22 (10) ◽  
pp. e19879
Author(s):  
Julian Gruendner ◽  
Nicolas Wolf ◽  
Lars Tögel ◽  
Florian Haller ◽  
Hans-Ulrich Prokosch ◽  
...  

Background The introduction of next-generation sequencing (NGS) into molecular cancer diagnostics has led to an increase in the data available for the identification and evaluation of driver mutations and for defining personalized cancer treatment regimens. The meaningful combination of omics data, ie, pathogenic gene variants and alterations with other patient data, to understand the full picture of malignancy has been challenging. Objective This study describes the implementation of a system capable of processing, analyzing, and subsequently combining NGS data with other clinical patient data for analysis within and across institutions. Methods On the basis of the already existing NGS analysis workflows for the identification of malignant gene variants at the Institute of Pathology of the University Hospital Erlangen, we defined basic requirements on an NGS processing and analysis pipeline and implemented a pipeline based on the GEMINI (GEnome MINIng) open source genetic variation database. For the purpose of validation, this pipeline was applied to data from the 1000 Genomes Project and subsequently to NGS data derived from 206 patients of a local hospital. We further integrated the pipeline into existing structures of data integration centers at the University Hospital Erlangen and combined NGS data with local nongenomic patient-derived data available in Fast Healthcare Interoperability Resources format. Results Using data from the 1000 Genomes Project and from the patient cohort as input, the implemented system produced the same results as already established methodologies. Further, it satisfied all our identified requirements and was successfully integrated into the existing infrastructure. Finally, we showed in an exemplary analysis how the data could be quickly loaded into and analyzed in KETOS, a web-based analysis platform for statistical analysis and clinical decision support. Conclusions This study demonstrates that the GEMINI open source database can be augmented to create an NGS analysis pipeline. The pipeline generates high-quality results consistent with the already established workflows for gene variant annotation and pathological evaluation. We further demonstrate how NGS-derived genomic and other clinical data can be combined for further statistical analysis, thereby providing for data integration using standardized vocabularies and methods. Finally, we demonstrate the feasibility of the pipeline integration into hospital workflows by providing an exemplary integration into the data integration center infrastructure, which is currently being established across Germany.





Sign in / Sign up

Export Citation Format

Share Document