scholarly journals Testing for Hardy-Weinberg Equilibrium in Structured Populations using NGS Data

2018 ◽  
Author(s):  
Jonas Meisner ◽  
Anders Albrechtsen

AbstractTesting for Hardy-Weinberg Equilibrium (HWE) is a common practice for quality control in genetic studies. Variable sites violating HWE may be identified as technical errors in the sequencing or genotyping process, or they may be of special evolutionary interest. Large-scale genetic studies based on next-generation sequencing (NGS) methods have become more prevalent as cost is decreasing but these methods are still associated with statistical uncertainty. The large-scale studies usually consist of samples from diverse ancestries that make the existence of some degree of population structure almost inevitable. Precautions are therefore needed when analyzing these datasets, as population structure causes deviations from HWE. Here we propose a method that takes population structure into account in the testing for HWE, such that other factors causing deviations from HWE can be detected. We show the effectiveness of our method in NGS data, as well as in genotype data, for both simulated and real datasets, where the use of genotype likelihoods enables us to model the uncertainty for low-depth sequencing data.

2017 ◽  
Author(s):  
Wei Hao ◽  
John D. Storey

AbstractTesting for Hardy-Weinberg equilibrium (HWE) is an important component in almost all analyses of population genetic data. Genetic markers that violate HWE are often treated as special cases; for example, they may be flagged as possible genotyping errors or they may be investigated more closely for evolutionary signatures of interest. The presence of population structure is one reason why genetic markers may fail a test of HWE. This is problematic because almost all natural populations studied in the modern setting show some degree of structure. Therefore, it is important to be able to detect deviations from HWE for reasons other than structure. To this end, we extend statistical tests of HWE to allow for population structure, which we call a test of “structural HWE” (sHWE). Additionally, our new test allows one to automatically choose tuning parameters and identify accurate models of structure. We demonstrate our approach on several important studies, provide theoretical justification for the test, and present empirical evidence for its utility. We anticipate the proposed test will be useful in a broad range of analyses of genome-wide population genetic data.


2017 ◽  
Vol 2 ◽  
pp. 35 ◽  
Author(s):  
Shazia Mahamdallie ◽  
Elise Ruark ◽  
Shawn Yost ◽  
Emma Ramsay ◽  
Imran Uddin ◽  
...  

Detection of deletions and duplications of whole exons (exon CNVs) is a key requirement of genetic testing. Accurate detection of this variant type has proved very challenging in targeted next-generation sequencing (NGS) data, particularly if only a single exon is involved. Many different NGS exon CNV calling methods have been developed over the last five years. Such methods are usually evaluated using simulated and/or in-house data due to a lack of publicly-available datasets with orthogonally generated results. This hinders tool comparisons, transparency and reproducibility. To provide a community resource for assessment of exon CNV calling methods in targeted NGS data, we here present the ICR96 exon CNV validation series. The dataset includes high-quality sequencing data from a targeted NGS assay (the TruSight Cancer Panel) together with Multiplex Ligation-dependent Probe Amplification (MLPA) results for 96 independent samples. 66 samples contain at least one validated exon CNV and 30 samples have validated negative results for exon CNVs in 26 genes. The dataset includes 46 exon CNVs in BRCA1, BRCA2, TP53, MLH1, MSH2, MSH6, PMS2, EPCAM or PTEN, giving excellent representation of the cancer predisposition genes most frequently tested in clinical practice. Moreover, the validated exon CNVs include 25 single exon CNVs, the most difficult type of exon CNV to detect. The FASTQ files for the ICR96 exon CNV validation series can be accessed through the European-Genome phenome Archive (EGA) under the accession number EGAS00001002428.


2019 ◽  
Author(s):  
Emil Jørsboe ◽  
Anders Albrechtsen

1AbstractIntroductionAssociation studies using genetic data from SNP-chip based imputation or low depth sequencing data provide a cost efficient design for large scale studies. However, these approaches provide genetic data with uncertainty of the observed genotypes. Here we explore association methods that can be applied to data where the genotype is not directly observed. We investigate how using different priors when estimating genotype probabilities affects the association results in different scenarios such as studies with population structure and varying depth sequencing data. We also suggest a method (ANGSD-asso) that is computational feasible for analysing large scale low depth sequencing data sets, such as can be generated by the non-invasive prenatal testing (NIPT) with low-pass sequencing.MethodsANGSD-asso’s EM model works by modelling the unobserved genotype as a latent variable in a generalised linear model framework. The software is implemented in C/C++ and can be run multi-threaded enabling the analysis of big data sets. ANGSD-asso is based on genotype probabilities, they can be estimated in various ways, such as using the sample allele frequency as a prior, using the individual allele frequencies as a prior or using haplotype frequencies from haplotype imputation. Using simulations of sequencing data we explore how genotype probability based method compares to using genetic dosages in large association studies with genotype uncertainty.Results & DiscussionOur simulations show that in a structured population using the individual allele frequency prior has better power than the sample allele frequency. If there is a correlation between genotype uncertainty and phenotype, then the individual allele frequency prior also helps control the false positive rate. In the absence of population structure the sample allele frequency prior and the individual allele frequency prior perform similarly. In scenarios with sequencing depth and phenotype correlation ANGSD-asso’s EM model has better statistical power and less bias compared to using dosages. Lastly when adding additional covariates to the linear model ANGSD-asso’s EM model has more statistical power and provides less biased effect sizes than other methods that accommodate genotype uncertainly, while also being much faster. This makes it possible to properly account for genotype uncertainty in large scale association studies.


2020 ◽  
Author(s):  
Charles Hadley S. King ◽  
Jonathon Keeney ◽  
Nuria Guimera ◽  
Souvik Das ◽  
Brian Fochtman ◽  
...  

AbstractFor regulatory submissions of next generation sequencing (NGS) data it is vital for the analysis workflow to be robust, reproducible, and understandable. This project demonstrates that the use of the IEEE 2791-2020 Standard, (BioCompute objects [BCO]) enables complete and concise communication of NGS data analysis results. One arm of a clinical trial was replicated using synthetically generated data made to resemble real biological data. Two separate, independent analyses were then carried out using BCOs as the tool for communication of analysis: one to simulate a pharmaceutical regulatory submission to the FDA, and another to simulate the FDA review. The two results were compared and tabulated for concordance analysis: of the 118 simulated patient samples generated, the final results of 117 (99.15%) were in agreement. This high concordance rate demonstrates the ability of a BCO, when a verification kit is included, to effectively capture and clearly communicate NGS analyses within regulatory submissions. BCO promotes transparency and induces reproducibility, thereby reinforcing trust in the regulatory submission process.


2017 ◽  
Author(s):  
Xin Zhou ◽  
Serafim Batzoglou ◽  
Arend Sidow ◽  
Lu Zhang

AbstractBackgroundDe novo mutations (DNMs) are associated with neurodevelopmental and congenital diseases, and their detection can contribute to understanding disease pathogenicity. However, accurate detection is challenging because of their small number relative to the genome-wide false positives in next generation sequencing (NGS) data. Software such as DeNovoGear and TrioDeNovo have been developed to detect DNMs, but at good sensitivity they still produce many false positive calls.ResultsTo address this challenge, we develop HAPDeNovo, a program that leverages phasing information from linked read sequencing, to remove false positive DNMs from candidate lists generated by DNM-detection tools. Short reads from each phasing block are allocated to each of the two haplotypes followed by generating a haploid genotype for each putative DNM.HAPDeNovo removes variants that are called as heterozygous in one of the haplotypes because they are almost certainly false positives. Our experiments on 10X Chromium linked read sequencing trio data reveal that HAPDeNovo eliminates 80% to 99% of false positives regardless of how large the candidate DNM set is.ConclusionsHAPDeNovo leverages the haplotype information from linked read sequencing to remove spurious false positive DNMs effectively, and it increases accuracy of DNM detection dramatically without sacrificing sensitivity.


2015 ◽  
Vol 14s5 ◽  
pp. CIN.S30793 ◽  
Author(s):  
Jian Li ◽  
Aarif Mohamed Nazeer Batcha ◽  
Björn Gaining ◽  
Ulrich R. Mansmann

Next-generation sequencing (NGS) technologies that have advanced rapidly in the past few years possess the potential to classify diseases, decipher the molecular code of related cell processes, identify targets for decision-making on targeted therapy or prevention strategies, and predict clinical treatment response. Thus, NGS is on its way to revolutionize oncology. With the help of NGS, we can draw a finer map for the genetic basis of diseases and can improve our understanding of diagnostic and prognostic applications and therapeutic methods. Despite these advantages and its potential, NGS is facing several critical challenges, including reduction of sequencing cost, enhancement of sequencing quality, improvement of technical simplicity and reliability, and development of semiautomated and integrated analysis workflow. In order to address these challenges, we conducted a literature research and summarized a four-stage NGS workflow for providing a systematic review on NGS-based analysis, explaining the strength and weakness of diverse NGS-based software tools, and elucidating its potential connection to individualized medicine. By presenting this four-stage NGS workflow, we try to provide a minimal structural layout required for NGS data storage and reproducibility.


2015 ◽  
Vol 63 (4) ◽  
pp. 275
Author(s):  
Andrea Bertram ◽  
P. Joana Dias ◽  
Sherralee Lukehurst ◽  
W. Jason Kennington ◽  
David Fairclough ◽  
...  

Bight redfish, Centroberyx gerrardi, is a demersal teleost endemic to continental shelf and upper slope waters of southern Australia. Throughout most of its range, C. gerrardi is targeted by a number of separately managed commercial and recreational fisheries across several jurisdictions. However, it is currently unknown whether stock assessments and management for this shared resource are being conducted at appropriate spatial scales, thereby requiring knowledge of population structure and connectivity. To investigate population structure and connectivity, we developed 16 new polymorphic microsatellite markers using 454 shotgun sequencing. Two to 15 alleles per locus were detected. There was no evidence of linkage disequilibrium between pairs of loci and all loci except one were in Hardy–Weinberg equilibrium. Cross-amplification trials in the congeneric C. australis and C. lineatus revealed that 11 and 16 loci are potentially useful, respectively. However, deviations from Hardy–Weinberg equilibrium and linkage disequilibrium between pairs of loci were detected at several of the 16 markers for C. australis, and therefore the number of markers useful for population genetic analyses with C. lineatus is likely considerably lower than 11.


2019 ◽  
Author(s):  
Xinzhu Wei ◽  
Rasmus Nielsen

AbstractPrevious analyses of the UK Biobank (UKB) genotyping array data in the CCR5-Δ32 locus show evidence for deviations from Hardy-Weinberg Equilibrium (HWE) and an increased mortality rate of homozygous individuals, consistent with a recessive deleterious effect of the deletion mutation. We here examine if similar deviations from HWE can be observed in the newly released UKB Whole Exome Sequencing (WES) data and in the sequencing data of the Genome Aggregation Database (gnomAD). We also examine the reliability of the genotype calls in the UKB array data. The UKB genotyping array probe targeting CCR5-Δ32 (rs62625034) and the WES of Δ32 are strongly correlated (r2 = 0.97). This contrasts to tag SNPs of CCR5-Δ32 in the UKB which have high missing data rates and imputation errors rates. We also show that, while different data sets are subject to different biases, both the UKB-WES and the gnomAD data have a deficiency of homozygous CCR5-Δ32 individuals compared to the HWE expectation (combined P-value < 0.01), consistent with an increased mortality rate in homozygotes. Finally, we perform a survival analysis on data from parents of UKB volunteers, that, while underpowered, is also consistent with the original report of a deleterious effect of CCR5-Δ32 in the homozygous state.


Sign in / Sign up

Export Citation Format

Share Document