Use of SNP chips to detect rare pathogenic variants: retrospective, population based diagnostic evaluation

Abstract Objective To determine whether the sensitivity and specificity of SNP chips are adequate for detecting rare pathogenic variants in a clinically unselected population. Design Retrospective, population based diagnostic evaluation. Participants 49 908 people recruited to the UK Biobank with SNP chip and next generation sequencing data, and an additional 21 people who purchased consumer genetic tests and shared their data online via the Personal Genome Project. Main outcome measures Genotyping (that is, identification of the correct DNA base at a specific genomic location) using SNP chips versus sequencing, with results split by frequency of that genotype in the population. Rare pathogenic variants in the BRCA1 and BRCA2 genes were selected as an exemplar for detailed analysis of clinically actionable variants in the UK Biobank, and BRCA related cancers (breast, ovarian, prostate, and pancreatic) were assessed in participants through use of cancer registry data. Results Overall, genotyping using SNP chips performed well compared with sequencing; sensitivity, specificity, positive predictive value, and negative predictive value were all above 99% for 108 574 common variants directly genotyped on the SNP chips and sequenced in the UK Biobank. However, the likelihood of a true positive result decreased dramatically with decreasing variant frequency; for variants that are very rare in the population, with a frequency below 0.001% in UK Biobank, the positive predictive value was very low and only 16% of 4757 heterozygous genotypes from the SNP chips were confirmed with sequencing data. Results were similar for SNP chip data from the Personal Genome Project, and 20/21 individuals analysed had at least one false positive rare pathogenic variant that had been incorrectly genotyped. For pathogenic variants in the BRCA1 and BRCA2 genes, which are individually very rare, the overall performance metrics for the SNP chips versus sequencing in the UK Biobank were: sensitivity 34.6%, specificity 98.3%, positive predictive value 4.2%, and negative predictive value 99.9%. Rates of BRCA related cancers in UK Biobank participants with a positive SNP chip result were similar to those for age matched controls (odds ratio 1.31, 95% confidence interval 0.99 to 1.71) because the vast majority of variants were false positives, whereas sequence positive participants had a significantly increased risk (odds ratio 4.05, 2.72 to 6.03). Conclusions SNP chips are extremely unreliable for genotyping very rare pathogenic variants and should not be used to guide health decisions without validation.

Download Full-text

Assessing the analytical validity of SNP-chips for detecting very rare pathogenic variants: implications for direct-to-consumer genetic testing

10.1101/696799 ◽

2019 ◽

Cited By ~ 14

Author(s):

Michael N Weedon ◽

Leigh Jackson ◽

James W Harrison ◽

Kate S Ruth ◽

Jessica Tyrrell ◽

...

Keyword(s):

Genetic Testing ◽

Genetic Variants ◽

Rare Variants ◽

Uk Biobank ◽

Sequencing Data ◽

Snp Chip ◽

Direct To Consumer ◽

Pathogenic Variants ◽

Variant Frequency ◽

The Uk

ABSTRACTObjectivesTo determine the analytical validity of SNP-chips for genotyping very rare genetic variants.DesignRetrospective study using data from two publicly available resources, the UK Biobank and the Personal Genome Project.SettingResearch biobanks and direct-to-consumer genetic testing in the UK and USA.Participants49,908 individuals recruited to UK Biobank, and 21 individuals who purchased consumer genetic tests and shared their data online via the Personal Genomes Project.Main outcome measuresWe assessed the analytical validity of genotypes from SNP-chips (index test) with sequencing data (reference standard). We evaluated the genotyping accuracy of the SNP-chips and split the results by variant frequency. We went on to select rare pathogenic variants in the BRCA1 and BRCA2 genes as an exemplar for detailed analysis of clinically-actionable variants in UK Biobank, and assessed BRCA-related cancers (breast, ovarian, prostate and pancreatic) in participants using cancer registry data.ResultsSNP-chip genotype accuracy is high overall; sensitivity, specificity and precision are all >99% for 108,574 common variants directly genotyped by the UK Biobank SNP-chips. However, the likelihood of a true positive result reduces dramatically with decreasing variant frequency; for variants with a frequency <0.001% in UK Biobank the precision is very low and only 16% of 4,711 variants from the SNP-chips confirm with sequencing data. Results are similar for SNP-chip data from the Personal Genomes Project, and 20/21 individuals have at least one rare pathogenic variant that has been incorrectly genotyped. For pathogenic variants in the BRCA1 and BRCA2 genes, the overall performance metrics of the SNP-chips in UK Biobank are sensitivity 34.6%, specificity 98.3% and precision 4.2%. Rates of BRCA-related cancers in individuals in UK Biobank with a positive SNP-chip result are similar to age-matched controls (OR 1.28, P=0.07, 95% CI: 0.98 to 1.67), while sequence-positive individuals have a significantly increased risk (OR 3.73, P=3.5×10−12, 95% CI: 2.57 to 5.40).ConclusionSNP-chips are extremely unreliable for genotyping very rare pathogenic variants and should not be used to guide health decisions without validation.SUMMARY BOXSection 1: What is already known on this topicSNP-chips are an accurate and affordable method for genotyping common genetic variants across the genome. They are often used by direct-to-consumer (DTC) genetic testing companies and research studies, but there several case reports suggesting they perform poorly for genotyping rare genetic variants when compared with sequencing.Section 2: What this study addsOur study confirms that SNP-chips are highly inaccurate for genotyping rare, clinically-actionable variants. Using large-scale SNP-chip and sequencing data from UK Biobank, we show that SNP-chips have a very low precision of <16% for detecting very rare variants (i.e. the majority of variants with population frequency of <0.001% are false positives). We observed a similar performance in a small sample of raw SNP-chip data from DTC genetic tests. Very rare variants assayed using SNP-chips should not be used to guide health decisions without validation.

Download Full-text

Novel genotyping algorithms for rare variants significantly improve the accuracy of Applied Biosystems™ Axiom™ array genotyping calls

10.1101/2021.09.13.459984 ◽

2021 ◽

Author(s):

Orna Mizrahi Man ◽

Marcos H Woehrmann ◽

Teresa A Webster ◽

Jeremy Gollub ◽

Adrian Bivol ◽

...

Keyword(s):

Positive Predictive Value ◽

Exome Sequencing ◽

Predictive Value ◽

Rare Variants ◽

Uk Biobank ◽

Sequencing Data ◽

Data Set ◽

Array Data ◽

Exome Sequencing Data ◽

The Uk

Objective: To significantly improve the positive predictive value (PPV) and sensitivity of Applied Biosystems™ Axiom™ array variant calling, by means of novel improvement to genotyping algorithms and careful quality control of array probesets. The improvement makes array genotyping more suitable for very rare variants. Design: Retrospective evaluation of UK Biobank array data re-genotyped with improved algorithms for rare variants. Participant: 488,359 people recruited to the UK Biobank with Axiom array genotyping data including 200,630 with exome sequencing data. Main Outcome Measures: A comparison of genotyping calls from array data to genotyping calls on a subset of variants with exome sequencing data. Results: Axiom genotyping [18] performed well, based on comparison to sequencing data, for over 100,000 common variants directly genotyped on the Axiom UK Biobank array and also exome sequenced by the UK Biobank Exome Sequencing Consortium. However, in a comparison to the initial exome sequencing results of the first 50K individuals, Weedon et al. [1] observed that when grouping these variants by the minor allele frequency (MAF) observed in UK Biobank, the concordance with sequencing and resulting positive predictive value (PPV) decreased with the number of heterozygous (Het) array calls per variant. An improved genotyping algorithm, Rare Heterozygous Adjustment (RHA) [16], released mid-2020 for genotyping on Axiom arrays, significantly improves PPV in all MAF ranges for the 50K data as well as when compared to the exome sequencing of 200K individuals, released after Weedon et al. [1] performed their comparison. The RHA algorithm improved PPVs in the 200K data in the lowest three frequency groups [0, 0.001%), [0.001%, 0.005%) and [0.005%, 0.01%) to 83%, 82% and 88%; respectively. PPV was above 95% for higher MAF ranges without algorithm improvement. PPVs are somewhat higher in the 200K dataset, due to a different "truth set" from exome sequencing and because monomorphic exome loci are not included in the joint genotyping calls for the 200K data set, as explained in the methods section. Sensitivity was higher in the 200K data set than in the original 50K data as well, especially for low MAF ranges. This increase is in part due to the larger data set over which sensitivity could be computed and in part due to the different WES algorithms used for the 200K data [7]. Filtering of a relatively small number of non-performing probesets (determined without reference to the exome sequencing data) significantly improved sensitivities for all MAF ranges, resulting in 70%, 88% and 94% respectively in the three lowest MAF ranges and greater than 98% and 99.9% for the two higher MAF ranges ([0.01%, 1%), [1%, 50%]). Conclusions: Improved algorithms for genotyping along with enhanced quality control of array probesets, significantly improve the positive predictive value and the sensitivity of array data, making it suitable for the detection of very rare variants. The probeset filtering methods developed have resulted in better probe designs for arrays and the new genotyping algorithm is part of the standard algorithm for all Axiom arrays since early 2020.

Download Full-text

Surveying the contribution of rare variants to the genetic architecture of human disease through exome sequencing of 177,882 UK Biobank participants

10.1101/2020.12.13.422582 ◽

2020 ◽

Author(s):

Quanli Wang ◽

Ryan S. Dhindsa ◽

Keren Carss ◽

Andrew R Harper ◽

Abhishek Nag ◽

...

Keyword(s):

Exome Sequencing ◽

Drug Targets ◽

Rare Variants ◽

Population Based ◽

Uk Biobank ◽

Loss Of Function ◽

Sequencing Data ◽

Phenotypic Data ◽

Protein Coding ◽

The Uk

The UK Biobank (UKB) represents an unprecedented population-based study of 502,543 participants with detailed phenotypic data and linkage to medical records. While the release of genotyping array data for this cohort has bolstered genomic discovery for common variants, the contribution of rare variants to this broad phenotype collection remains relatively unknown. Here, we use exome sequencing data from 177,882 UKB participants to evaluate the association between rare protein-coding variants with 10,533 binary and 1,419 quantitative phenotypes. We performed both a variant-level phenome-wide association study (PheWAS) and a gene-level collapsing analysis-based PheWAS tailored to detecting the aggregate contribution of rare variants. The latter revealed 911 statistically significant gene-phenotype relationships, with a median odds ratio of 15.7 for binary traits. Among the binary trait associations identified using collapsing analysis, 83% were undetectable using single variant association tests, emphasizing the power of collapsing analysis to detect signal in the setting of high allelic heterogeneity. As a whole, these genotype-phenotype associations were significantly enriched for loss-of-function mediated traits and currently approved drug targets. Using these results, we summarise the contribution of rare variants to common diseases in the context of the UKB phenome and provide an example of how novel gene-phenotype associations can aid in therapeutic target prioritisation.

Download Full-text

Biobanks and Legislation in Switzerland – a data protection perspective

Journal of International Biotechnology Law ◽

10.1515/jibl.2007.028 ◽

2007 ◽

Vol 4 (5) ◽

Author(s):

Claudia Mund

Keyword(s):

Medical Records ◽

Population Based ◽

Genome Project ◽

Genetic Interactions ◽

Government Support ◽

Uk Biobank ◽

Large Section ◽

Homogeneous Population ◽

Set Up ◽

The Uk

AbstractThe combination of health and lifestyle data with bodily substances and genetic information has led over the last few years to the creation of so-called biobanks. These biobanks are used to research a large number of diseases of modern civilization and their genetic interactions. One of the best-known projects in this respect is without a doubt the Icelandic biobank, operated since 1998 by a private pharmaceutical company, deCode Genetics, and set up with government support. The database contains family histories and medical records, as well as biological samples taken from a large section of Iceland's homogeneous population. The hope is that this data will allow a correlation to be established between genetic predispositions and the onset of widespread diseases. Other examples of population-based biobanks include the Estonian Genome Project, launched in 2001 by the Estonian Government, as well as the UK Biobank in the UK and PopGen in the state of Schleswig-Holstein in Germany.

Download Full-text

Associations Between Insomnia Symptoms and Mortality in the UK Biobank Cohort: A Prospective Population-Based Study

SSRN Electronic Journal ◽

10.2139/ssrn.3417879 ◽

2019 ◽

Author(s):

Malcolm von Schantz ◽

Jason C. Ong ◽

Kristen Knutson

Keyword(s):

Population Based ◽

Uk Biobank ◽

Population Based Study ◽

The Uk ◽

Insomnia Symptoms

Download Full-text

Abstract 13296: Risk Loci of Hypertrophic Cardiomyopathy Identified via the UK Biobank

Circulation ◽

10.1161/circ.142.suppl_3.13296 ◽

2020 ◽

Vol 142 (Suppl_3) ◽

Author(s):

Alex Gyftopoulos ◽

Yi-Ju Chen ◽

Libin Wang ◽

Charles H Williams ◽

Young Wook Chun ◽

...

Keyword(s):

Hypertrophic Cardiomyopathy ◽

Large Population ◽

Phenotypic Expression ◽

Coiled Coil ◽

Intraflagellar Transport ◽

Genetic Profile ◽

Uk Biobank ◽

Disease Manifestation ◽

Pathogenic Variants ◽

The Uk

Introduction: Hypertrophic cardiomyopathy (HCM) is the most commonly inherited cardiac disease affecting 1:500 to 1:200 individuals worldwide. HCM has a heterogeneous genetic profile and phenotypic expression. More than 1400 known pathogenic variants have been identified in 11 sarcomere genes. In about 40% of HCM patients, the genetic cause may not be identified. The same mutation may lead to different phenotypes and severity in different individuals. Identification of novel HCM genes and modifiers will expand our understanding of the signaling pathways that are responsible for phenotypic expression of HCM. Methods: The UK Biobank comprises clinical and genetic data for greater than 500,000 individuals. We used OASIS, an information system for analyzing, searching, and visualizing associations between phenotype and genotype data to analyze this data. We compared control individuals to HCM individuals identified by ICD-10 code (I42.1 and I42.2) in a 20-to-1 fashion. Related individuals and those with confounding diagnoses were excluded. Results: The analysis was performed with Plink’s GLM option, and we identified 84 variants with a minor allele frequency of 0.5% or greater in 65 genes associated with HCM with a p < 1x10 -6 , including 4 with p < 5x10 -8 . The identified genes encode lncRNAs, miRNAs, and membrane proteins. Variants with high significance were identified in the genes encoding putative ciliary components DNAL4 (dynein axonemal light chain 4; p = 2.9x10 -8 ), MYO1D (unconventional myosin 1D; p = 3.1x10 -8 ), ITFAP (intraflagellar transport associated protein; p = 9.5x10 -8 ), CABCOCO1 (ciliary associated calcium biding coiled-coil 1; p = 3.7x 10 -7 ), EVL (Enah-Vasp-like; p = 4.4x 10 -7 ) and IFT122 (intraflagellar transport 122; p = 8.0 x10 -7 ). Conclusion: While none of these have previously associated with HCM, our findings suggest ciliary structure and function may play a role in disease manifestation. Our method is unique by pooling individuals in a large population set to identify potential causative or contributing mutations. Bioinformatic tools, such as OASIS, allow for the identification of previously unrecognized variants that may play a role in the development of HCM. This approach has identified numerous novel genes as possible risk loci.

Download Full-text

Comparison of prognostic models to predict the occurrence of colorectal cancer in asymptomatic individuals: a systematic literature review and external validation in the EPIC and UK Biobank prospective cohort studies

Gut ◽

10.1136/gutjnl-2017-315730 ◽

2018 ◽

Vol 68 (4) ◽

pp. 672-683 ◽

Cited By ~ 6

Author(s):

Todd Smith ◽

David C Muller ◽

Karel G M Moons ◽

Amanda J Cross ◽

Mattias Johansson ◽

...

Keyword(s):

Colorectal Cancer ◽

Systematic Review ◽

Prediction Models ◽

Large Population ◽

External Validation ◽

Population Based ◽

Prognostic Models ◽

Uk Biobank ◽

Asymptomatic Individuals ◽

The Uk

ObjectiveTo systematically identify and validate published colorectal cancer risk prediction models that do not require invasive testing in two large population-based prospective cohorts.DesignModels were identified through an update of a published systematic review and validated in the European Prospective Investigation into Cancer and Nutrition (EPIC) and the UK Biobank. The performance of the models to predict the occurrence of colorectal cancer within 5 or 10 years after study enrolment was assessed by discrimination (C-statistic) and calibration (plots of observed vs predicted probability).ResultsThe systematic review and its update identified 16 models from 8 publications (8 colorectal, 5 colon and 3 rectal). The number of participants included in each model validation ranged from 41 587 to 396 515, and the number of cases ranged from 115 to 1781. Eligible and ineligible participants across the models were largely comparable. Calibration of the models, where assessable, was very good and further improved by recalibration. The C-statistics of the models were largely similar between validation cohorts with the highest values achieved being 0.70 (95% CI 0.68 to 0.72) in the UK Biobank and 0.71 (95% CI 0.67 to 0.74) in EPIC.ConclusionSeveral of these non-invasive models exhibited good calibration and discrimination within both external validation populations and are therefore potentially suitable candidates for the facilitation of risk stratification in population-based colorectal screening programmes. Future work should both evaluate this potential, through modelling and impact studies, and ascertain if further enhancement in their performance can be obtained.

Download Full-text

Investigating the impact of disease and health record duration on the eMERGE algorithm for rheumatoid arthritis

Journal of the American Medical Informatics Association ◽

10.1093/jamia/ocaa014 ◽

2020 ◽

Vol 27 (4) ◽

pp. 601-605

Author(s):

Vanessa L Kronzer ◽

Liwei Wang ◽

Hongfang Liu ◽

John M Davis ◽

Jeffrey A Sparks ◽

...

Keyword(s):

Rheumatoid Arthritis ◽

Positive Predictive Value ◽

Negative Predictive Value ◽

Predictive Value ◽

Index Date ◽

Model Performance ◽

Area Under The Curve ◽

Population Based ◽

Health Record ◽

The Impact

Abstract Objective The study sought to determine the dependence of the Electronic Medical Records and Genomics (eMERGE) rheumatoid arthritis (RA) algorithm on both RA and electronic health record (EHR) duration. Materials and Methods Using a population-based cohort from the Mayo Clinic Biobank, we identified 497 patients with at least 1 RA diagnosis code. RA case status was manually determined using validated criteria for RA. RA duration was defined as time from first RA code to the index date of biobank enrollment. To simulate EHR duration, various years of EHR lookback were applied, starting at the index date and going backward. Model performance was determined by sensitivity, specificity, positive predictive value, negative predictive value, and area under the curve (AUC). Results The eMERGE algorithm performed well in this cohort, with overall sensitivity 53%, specificity 99%, positive predictive value 97%, negative predictive value 74%, and AUC 76%. Among patients with RA duration <2 years, sensitivity and AUC were only 9% and 54%, respectively, but increased to 71% and 85% among patients with RA duration >10 years. Longer EHR lookback also improved model performance up to a threshold of 10 years, in which sensitivity reached 52% and AUC 75%. However, optimal EHR lookback varied by RA duration; an EHR lookback of 3 years was best able to identify recently diagnosed RA cases. Conclusions eMERGE algorithm performance improves with longer RA duration as well as EHR duration up to 10 years, though shorter EHR lookback can improve identification of recently diagnosed RA cases.

Download Full-text

Altered Cortical Brain Structure and Increased Risk for Disease Seen Decades After Perinatal Exposure to Maternal Smoking: A Study of 9000 Adults in the UK Biobank

Cerebral Cortex ◽

10.1093/cercor/bhz060 ◽

2019 ◽

Vol 29 (12) ◽

pp. 5217-5233 ◽

Cited By ~ 4

Author(s):

Lauren E Salminen ◽

Rand R Wilcox ◽

Alyssa H Zhu ◽

Brandalyn C Riedel ◽

Christopher R K Ching ◽

...

Keyword(s):

Brain Structure ◽

Brain Mri ◽

Population Based ◽

Smoke Exposure ◽

Sensitivity Analyses ◽

Uk Biobank ◽

Increased Risk ◽

Increase Risk ◽

Sensory Cortices ◽

The Uk

Abstract Secondhand smoke exposure is a major public health risk that is especially harmful to the developing brain, but it is unclear if early exposure affects brain structure during middle age and older adulthood. Here we analyzed brain MRI data from the UK Biobank in a population-based sample of individuals (ages 44–80) who were exposed (n = 2510) or unexposed (n = 6079) to smoking around birth. We used robust statistical models, including quantile regressions, to test the effect of perinatal smoke exposure (PSE) on cortical surface area (SA), thickness, and subcortical volumes. We hypothesized that PSE would be associated with cortical disruption in primary sensory areas compared to unexposed (PSE−) adults. After adjusting for multiple comparisons, SA was significantly lower in the pericalcarine (PCAL), inferior parietal (IPL), and regions of the temporal and frontal cortex of PSE+ adults; these abnormalities were associated with increased risk for several diseases, including circulatory and endocrine conditions. Sensitivity analyses conducted in a hold-out group of healthy participants (exposed, n = 109, unexposed, n = 315) replicated the effect of PSE on SA in the PCAL and IPL. Collectively our results show a negative, long term effect of PSE on sensory cortices that may increase risk for disease later in life.

Download Full-text