How Array Design affects SNP Ascertainment Bias

Mapping Intimacies ◽

10.1101/833541 ◽

2019 ◽

Cited By ~ 1

Author(s):

Johannes Geibel ◽

Christian Reimer ◽

Steffen Weigend ◽

Annett Weigend ◽

Torsten Pook ◽

...

Keyword(s):

Population Genetic ◽

Large Scale ◽

Development Process ◽

Rare Variants ◽

Ascertainment Bias ◽

Whole Genome ◽

Nucleotide Polymorphisms ◽

Sequencing Data ◽

Frequency Spectra ◽

Original Array

AbstractSingle nucleotide polymorphisms (SNPs), genotyped with SNP arrays, have become a widely used marker type in population genetic analyses over the last 10 years. However, compared to whole genome re-sequencing data, arrays are known to lack a substantial proportion of globally rare variants and tend to be biased towards variants present in populations involved in the development process of the respective array. This affects population genetic estimators and is known as SNP ascertainment bias. We investigated factors contributing to ascertainment bias in array development by redesigning the Axiom™ Genome-Wide Chicken Array in silico and evaluating changes in allele frequency spectra and heterozygosity estimates in a stepwise manner. A sequential reduction of rare alleles during the development process was shown with main influencing factors being the identification of SNPs in a limited set of populations and a within-population selection of common SNPs when aiming for equidistant spacing. These effects were shown to be less severe with a larger discovery panel. Additionally, a generally massive overestimation of expected heterozygosity for the ascertained SNP sets was shown. This overestimation was 24% higher for populations involved in the discovery process than not involved populations in case of the original array. The same was observed after the SNP discovery step in the redesign. However, an unequal contribution of populations during the SNP selection can mask this effect but also adds uncertainty. Finally, we make suggestions for the design of specialized arrays for large scale projects where whole genome re-sequencing techniques are still too expensive.

How array design creates SNP ascertainment bias

PLoS ONE ◽

10.1371/journal.pone.0245178 ◽

2021 ◽

Vol 16 (3) ◽

pp. e0245178

Author(s):

Johannes Geibel ◽

Christian Reimer ◽

Steffen Weigend ◽

Annett Weigend ◽

Torsten Pook ◽

...

Keyword(s):

Population Genetic ◽

Large Scale ◽

Development Process ◽

Rare Variants ◽

Ascertainment Bias ◽

Whole Genome ◽

Nucleotide Polymorphisms ◽

Sequencing Data ◽

Frequency Spectra ◽

Original Array

Single nucleotide polymorphisms (SNPs), genotyped with arrays, have become a widely used marker type in population genetic analyses over the last 10 years. However, compared to whole genome re-sequencing data, arrays are known to lack a substantial proportion of globally rare variants and tend to be biased towards variants present in populations involved in the development process of the respective array. This affects population genetic estimators and is known as SNP ascertainment bias. We investigated factors contributing to ascertainment bias in array development by redesigning the Axiom™ Genome-Wide Chicken Array in silico and evaluating changes in allele frequency spectra and heterozygosity estimates in a stepwise manner. A sequential reduction of rare alleles during the development process was shown. This was mainly caused by the identification of SNPs in a limited set of populations and a within-population selection of common SNPs when aiming for equidistant spacing. These effects were shown to be less severe with a larger discovery panel. Additionally, a generally massive overestimation of expected heterozygosity for the ascertained SNP sets was shown. This overestimation was 24% higher for populations involved in the discovery process than not involved populations in case of the original array. The same was observed after the SNP discovery step in the redesign. However, an unequal contribution of populations during the SNP selection can mask this effect but also adds uncertainty. Finally, we make suggestions for the design of specialized arrays for large scale projects where whole genome re-sequencing techniques are still too expensive.

Plasmids or no plasmids? A comparison between the agilent TapeStation and whole-genome sequencing data in a large-scale bacterial sequencing project

10.26226/morressier.56d5ba27d462b80296c95fe7 ◽

2016 ◽

Author(s):

Sarah Alexander

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Large Scale ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Sequencing Project

Risk prediction and marker selection in nonsynonymous single nucleotide polymorphisms using whole genome sequencing data

Animal Cells and Systems ◽

10.1080/19768354.2020.1860125 ◽

2020 ◽

Vol 24 (6) ◽

pp. 321-328

Author(s):

Young-Sup Lee ◽

KyeongHye Won ◽

Donghyun Shin ◽

Jae-Don Oh

Keyword(s):

Single Nucleotide Polymorphisms ◽

Whole Genome Sequencing ◽

Risk Prediction ◽

Genome Sequencing ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Nucleotide Polymorphisms ◽

Sequencing Data ◽

Single Nucleotide ◽

Marker Selection

Population Genomics of American Mink Using Whole Genome Sequencing Data

Genes ◽

10.3390/genes12020258 ◽

2021 ◽

Vol 12 (2) ◽

pp. 258

Author(s):

Karim Karimi ◽

Duy Ngoc Do ◽

Mehdi Sargolzaei ◽

Younes Miar

Keyword(s):

Population Genomics ◽

Association Studies ◽

American Mink ◽

Population History ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Nucleotide Polymorphisms ◽

Sequencing Data ◽

Effective Population ◽

Cross Validation Error

Characterizing the genetic structure and population history can facilitate the development of genomic breeding strategies for the American mink. In this study, we used the whole genome sequences of 100 mink from the Canadian Centre for Fur Animal Research (CCFAR) at the Dalhousie Faculty of Agriculture (Truro, NS, Canada) and Millbank Fur Farm (Rockwood, ON, Canada) to investigate their population structure, genetic diversity and linkage disequilibrium (LD) patterns. Analysis of molecular variance (AMOVA) indicated that the variation among color-types was significant (p < 0.001) and accounted for 18% of the total variation. The admixture analysis revealed that assuming three ancestral populations (K = 3) provided the lowest cross-validation error (0.49). The effective population size (Ne) at five generations ago was estimated to be 99 and 50 for CCFAR and Millbank Fur Farm, respectively. The LD patterns revealed that the average r2 reduced to <0.2 at genomic distances of >20 kb and >100 kb in CCFAR and Millbank Fur Farm suggesting that the density of 120,000 and 24,000 single nucleotide polymorphisms (SNP) would provide the adequate accuracy of genomic evaluation in these populations, respectively. These results indicated that accounting for admixture is critical for designing the SNP panels for genotype-phenotype association studies of American mink.

Global sequence characterization of rice centromeric satellite based on oligomer frequency analysis in large-scale sequencing data

Bioinformatics ◽

10.1093/bioinformatics/btq343 ◽

2010 ◽

Vol 26 (17) ◽

pp. 2101-2108 ◽

Cited By ~ 27

Author(s):

Jiří Macas ◽

Pavel Neumann ◽

Petr Novák ◽

Jiming Jiang

Keyword(s):

Large Scale ◽

Rice Genome ◽

Supplementary Information ◽

Sequencing Data ◽

Satellite Repeat ◽

Frequency Spectra ◽

Consensus Sequences ◽

Chip Sequencing ◽

Conserved Sequence ◽

Centromeric Satellite

Abstract Motivation: Satellite DNA makes up significant portion of many eukaryotic genomes, yet it is relatively poorly characterized even in extensively sequenced species. This is, in part, due to methodological limitations of traditional methods of satellite repeat analysis, which are based on multiple alignments of monomer sequences. Therefore, we employed an alternative, alignment-free, approach utilizing k-mer frequency statistics, which is in principle more suitable for analyzing large sets of satellite repeat data, including sequence reads from next generation sequencing technologies. Results: k-mer frequency spectra were determined for two sets of rice centromeric satellite CentO sequences, including 454 reads from ChIP-sequencing of CENH3-bound DNA (7.6 Mb) and the whole genome Sanger sequencing reads (5.8 Mb). k-mer frequencies were used to identify the most conserved sequence regions and to reconstruct consensus sequences of complete monomers. Reconstructed consensus sequences as well as the assessment of overall divergence of k-mer spectra revealed high similarity of the two datasets, suggesting that CentO sequences associated with functional centromeres (CENH3-bound) do not significantly differ from the total population of CentO, which includes both centromeric and pericentromeric repeat arrays. On the other hand, considerable differences were revealed when these methods were used for comparison of CentO populations between individual chromosomes of the rice genome assembly, demonstrating preferential sequence homogenization of the clusters within the same chromosome. k-mer frequencies were also successfully used to identify and characterize smRNAs derived from CentO repeats. Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.

Leveraging linkage evidence to identify low-frequency and rare variants on 16p13 associated with blood pressure using TOPMed whole genome sequencing data

Human Genetics ◽

10.1007/s00439-019-01975-0 ◽

2019 ◽

Vol 138 (2) ◽

pp. 199-210 ◽

Cited By ~ 7

Author(s):

Karen Y. He ◽

◽

Xiaoyin Li ◽

Tanika N. Kelly ◽

Jingjing Liang ◽

...

Keyword(s):

Blood Pressure ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Rare Variants ◽

Low Frequency ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Linkage Evidence

Large-Scale Whole-Genome Sequencing Reveals the Genetic Architecture of Primary Membranoproliferative GN and C3 Glomerulopathy

Journal of the American Society of Nephrology ◽

10.1681/asn.2019040433 ◽

2020 ◽

Vol 31 (2) ◽

pp. 365-373 ◽

Cited By ~ 7

Author(s):

Adam P. Levine ◽

Melanie M.Y. Chan ◽

Omid Sadeghi-Alavijeh ◽

Edwin K.S. Wong ◽

H. Terence Cook ◽

...

Keyword(s):

Large Scale ◽

Rare Variants ◽

Alternative Pathway ◽

Atypical Hemolytic Uremic Syndrome ◽

Gene Mutations ◽

Whole Genome Sequence ◽

European Ancestry ◽

Whole Genome ◽

C3 Glomerulopathy ◽

Complement Gene

BackgroundPrimary membranoproliferative GN, including complement 3 (C3) glomerulopathy, is a rare, untreatable kidney disease characterized by glomerular complement deposition. Complement gene mutations can cause familial C3 glomerulopathy, and studies have reported rare variants in complement genes in nonfamilial primary membranoproliferative GN.MethodsWe analyzed whole-genome sequence data from 165 primary membranoproliferative GN cases and 10,250 individuals without the condition (controls) as part of the National Institutes of Health Research BioResource–Rare Diseases Study. We examined copy number, rare, and common variants.ResultsOur analysis included 146 primary membranoproliferative GN cases and 6442 controls who were unrelated and of European ancestry. We observed no significant enrichment of rare variants in candidate genes (genes encoding components of the complement alternative pathway and other genes associated with the related disease atypical hemolytic uremic syndrome; 6.8% in cases versus 5.9% in controls) or exome-wide. However, a significant common variant locus was identified at 6p21.32 (rs35406322) (P=3.29×10−8; odds ratio [OR], 1.93; 95% confidence interval [95% CI], 1.53 to 2.44), overlapping the HLA locus. Imputation of HLA types mapped this signal to a haplotype incorporating DQA1*05:01, DQB1*02:01, and DRB1*03:01 (P=1.21×10−8; OR, 2.19; 95% CI, 1.66 to 2.89). This finding was replicated by analysis of HLA serotypes in 338 individuals with membranoproliferative GN and 15,614 individuals with nonimmune renal failure.ConclusionsWe found that HLA type, but not rare complement gene variation, is associated with primary membranoproliferative GN. These findings challenge the paradigm of complement gene mutations typically causing primary membranoproliferative GN and implicate an underlying autoimmune mechanism in most cases.

High definition analyses of single cohort, whole genome sequencing data provides a direct route to defining sub-phenotypes and personalising medicine

10.1101/2021.08.28.21262560 ◽

2021 ◽

Author(s):

KE Joyce ◽

E Onabanjo ◽

S Brownlow ◽

F Nur ◽

KO Olupona ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Rare Variants ◽

Phenotypic Variability ◽

Clinical Impact ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

High Definition ◽

Genomic Damage

ABSTRACTPossession of a clinical or molecular disease label alters the context in which life-course events operate, but rarely explains the phenotypic variability observed by clinicians. Whole genome sequencing of unselected endothelial vasculopathy patients demonstrated more than a third had rare, likely deleterious variants in clinically-relevant genes unrelated to their vasculopathy (1 in 10 within platelet genes; 1 in 8 within coagulation genes; and 1 in 4 within erythrocyte hemolytic genes). High erythrocyte membrane variant rates paralleled genomic damage and prevalence indices in the general population. In blinded analyses, patients with greater hemorrhagic severity that had been attributed solely to their vasculopathy had more deleterious variants in platelet (Spearman ρ=0.25, p=0.008) and coagulation (Spearman ρ=0.21, p=0.024) genes. We conclude that rare diseases can provide insights for medicine beyond their primary pathophysiology, and propose a framework based on rare variants to inform interpretative approaches to accelerate clinical impact from whole genome sequencing.

Population-level genome-wide STR typing in Plasmodium species reveals higher resolution population structure and genetic diversity relative to SNP typing

10.1101/2021.05.19.444768 ◽

2021 ◽

Author(s):

Jiru Han ◽

Jacob E Munro ◽

Anthony Kocoski ◽

Alyssa E Barry ◽

Melanie Bahlo

Keyword(s):

Genetic Diversity ◽

Large Scale ◽

Tandem Repeats ◽

Plasmodium Species ◽

Whole Genome Sequence ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Genome Wide ◽

Field Samples

Short tandem repeats (STRs) are highly informative genetic markers that have been used extensively in population genetics analysis. They are an important source of genetic diversity and can also have functional impact. Despite the availability of bioinformatic methods that permit large-scale genome-wide genotyping of STRs from whole genome sequencing data, they have not previously been applied to sequencing data from large collections of malaria parasite field samples. Here, we have genotyped STRs using HipSTR in more than 3,000 Plasmodium falciparum and 174 Plasmodium vivax published whole-genome sequence data from samples collected across the globe. High levels of noise and variability in the resultant callset necessitated the development of a novel method for quality control of STR genotype calls. A set of high-quality STR loci (6,768 from P. falciparum and 3,496 from P. vivax) were used to study Plasmodium genetic diversity, population structures and genomic signatures of selection and these were compared to genome-wide single nucleotide polymorphism (SNP) genotyping data. In addition, the genome-wide information about genetic variation and other characteristics of STRs in P. falciparum and P. vivax have been made available in an interactive web-based R Shiny application PlasmoSTR (https://github.com/bahlolab/PlasmoSTR).

P4-097: RARE VARIANTS IN FAMILIAL LATE-ONSET ALZHEIMER'S DISEASE IDENTIFIED FROM LARGE SCALE WHOLE GENOME SEQUENCING

Alzheimer s & Dementia ◽

10.1016/j.jalz.2019.06.3757 ◽

2019 ◽

Vol 15 ◽

pp. P1312-P1312

Author(s):

Badri N. Vardarajan ◽

James Jaworski ◽

Gary W. Beecham ◽

Sandra Barral ◽

Dolly Reyes-Dumeyer ◽

...

Keyword(s):

Alzheimer’S Disease ◽

Alzheimer's Disease ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Large Scale ◽

Rare Variants ◽

Late Onset ◽

Whole Genome