scholarly journals How Array Design affects SNP Ascertainment Bias

2019 ◽  
Author(s):  
Johannes Geibel ◽  
Christian Reimer ◽  
Steffen Weigend ◽  
Annett Weigend ◽  
Torsten Pook ◽  
...  

AbstractSingle nucleotide polymorphisms (SNPs), genotyped with SNP arrays, have become a widely used marker type in population genetic analyses over the last 10 years. However, compared to whole genome re-sequencing data, arrays are known to lack a substantial proportion of globally rare variants and tend to be biased towards variants present in populations involved in the development process of the respective array. This affects population genetic estimators and is known as SNP ascertainment bias. We investigated factors contributing to ascertainment bias in array development by redesigning the Axiom™ Genome-Wide Chicken Array in silico and evaluating changes in allele frequency spectra and heterozygosity estimates in a stepwise manner. A sequential reduction of rare alleles during the development process was shown with main influencing factors being the identification of SNPs in a limited set of populations and a within-population selection of common SNPs when aiming for equidistant spacing. These effects were shown to be less severe with a larger discovery panel. Additionally, a generally massive overestimation of expected heterozygosity for the ascertained SNP sets was shown. This overestimation was 24% higher for populations involved in the discovery process than not involved populations in case of the original array. The same was observed after the SNP discovery step in the redesign. However, an unequal contribution of populations during the SNP selection can mask this effect but also adds uncertainty. Finally, we make suggestions for the design of specialized arrays for large scale projects where whole genome re-sequencing techniques are still too expensive.


PLoS ONE ◽  
2021 ◽  
Vol 16 (3) ◽  
pp. e0245178
Author(s):  
Johannes Geibel ◽  
Christian Reimer ◽  
Steffen Weigend ◽  
Annett Weigend ◽  
Torsten Pook ◽  
...  

Single nucleotide polymorphisms (SNPs), genotyped with arrays, have become a widely used marker type in population genetic analyses over the last 10 years. However, compared to whole genome re-sequencing data, arrays are known to lack a substantial proportion of globally rare variants and tend to be biased towards variants present in populations involved in the development process of the respective array. This affects population genetic estimators and is known as SNP ascertainment bias. We investigated factors contributing to ascertainment bias in array development by redesigning the Axiom™ Genome-Wide Chicken Array in silico and evaluating changes in allele frequency spectra and heterozygosity estimates in a stepwise manner. A sequential reduction of rare alleles during the development process was shown. This was mainly caused by the identification of SNPs in a limited set of populations and a within-population selection of common SNPs when aiming for equidistant spacing. These effects were shown to be less severe with a larger discovery panel. Additionally, a generally massive overestimation of expected heterozygosity for the ascertained SNP sets was shown. This overestimation was 24% higher for populations involved in the discovery process than not involved populations in case of the original array. The same was observed after the SNP discovery step in the redesign. However, an unequal contribution of populations during the SNP selection can mask this effect but also adds uncertainty. Finally, we make suggestions for the design of specialized arrays for large scale projects where whole genome re-sequencing techniques are still too expensive.



Genes ◽  
2021 ◽  
Vol 12 (2) ◽  
pp. 258
Author(s):  
Karim Karimi ◽  
Duy Ngoc Do ◽  
Mehdi Sargolzaei ◽  
Younes Miar

Characterizing the genetic structure and population history can facilitate the development of genomic breeding strategies for the American mink. In this study, we used the whole genome sequences of 100 mink from the Canadian Centre for Fur Animal Research (CCFAR) at the Dalhousie Faculty of Agriculture (Truro, NS, Canada) and Millbank Fur Farm (Rockwood, ON, Canada) to investigate their population structure, genetic diversity and linkage disequilibrium (LD) patterns. Analysis of molecular variance (AMOVA) indicated that the variation among color-types was significant (p < 0.001) and accounted for 18% of the total variation. The admixture analysis revealed that assuming three ancestral populations (K = 3) provided the lowest cross-validation error (0.49). The effective population size (Ne) at five generations ago was estimated to be 99 and 50 for CCFAR and Millbank Fur Farm, respectively. The LD patterns revealed that the average r2 reduced to <0.2 at genomic distances of >20 kb and >100 kb in CCFAR and Millbank Fur Farm suggesting that the density of 120,000 and 24,000 single nucleotide polymorphisms (SNP) would provide the adequate accuracy of genomic evaluation in these populations, respectively. These results indicated that accounting for admixture is critical for designing the SNP panels for genotype-phenotype association studies of American mink.



2010 ◽  
Vol 26 (17) ◽  
pp. 2101-2108 ◽  
Author(s):  
Jiří Macas ◽  
Pavel Neumann ◽  
Petr Novák ◽  
Jiming Jiang

Abstract Motivation: Satellite DNA makes up significant portion of many eukaryotic genomes, yet it is relatively poorly characterized even in extensively sequenced species. This is, in part, due to methodological limitations of traditional methods of satellite repeat analysis, which are based on multiple alignments of monomer sequences. Therefore, we employed an alternative, alignment-free, approach utilizing k-mer frequency statistics, which is in principle more suitable for analyzing large sets of satellite repeat data, including sequence reads from next generation sequencing technologies. Results: k-mer frequency spectra were determined for two sets of rice centromeric satellite CentO sequences, including 454 reads from ChIP-sequencing of CENH3-bound DNA (7.6 Mb) and the whole genome Sanger sequencing reads (5.8 Mb). k-mer frequencies were used to identify the most conserved sequence regions and to reconstruct consensus sequences of complete monomers. Reconstructed consensus sequences as well as the assessment of overall divergence of k-mer spectra revealed high similarity of the two datasets, suggesting that CentO sequences associated with functional centromeres (CENH3-bound) do not significantly differ from the total population of CentO, which includes both centromeric and pericentromeric repeat arrays. On the other hand, considerable differences were revealed when these methods were used for comparison of CentO populations between individual chromosomes of the rice genome assembly, demonstrating preferential sequence homogenization of the clusters within the same chromosome. k-mer frequencies were also successfully used to identify and characterize smRNAs derived from CentO repeats. Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.





2020 ◽  
Vol 31 (2) ◽  
pp. 365-373 ◽  
Author(s):  
Adam P. Levine ◽  
Melanie M.Y. Chan ◽  
Omid Sadeghi-Alavijeh ◽  
Edwin K.S. Wong ◽  
H. Terence Cook ◽  
...  

BackgroundPrimary membranoproliferative GN, including complement 3 (C3) glomerulopathy, is a rare, untreatable kidney disease characterized by glomerular complement deposition. Complement gene mutations can cause familial C3 glomerulopathy, and studies have reported rare variants in complement genes in nonfamilial primary membranoproliferative GN.MethodsWe analyzed whole-genome sequence data from 165 primary membranoproliferative GN cases and 10,250 individuals without the condition (controls) as part of the National Institutes of Health Research BioResource–Rare Diseases Study. We examined copy number, rare, and common variants.ResultsOur analysis included 146 primary membranoproliferative GN cases and 6442 controls who were unrelated and of European ancestry. We observed no significant enrichment of rare variants in candidate genes (genes encoding components of the complement alternative pathway and other genes associated with the related disease atypical hemolytic uremic syndrome; 6.8% in cases versus 5.9% in controls) or exome-wide. However, a significant common variant locus was identified at 6p21.32 (rs35406322) (P=3.29×10−8; odds ratio [OR], 1.93; 95% confidence interval [95% CI], 1.53 to 2.44), overlapping the HLA locus. Imputation of HLA types mapped this signal to a haplotype incorporating DQA1*05:01, DQB1*02:01, and DRB1*03:01 (P=1.21×10−8; OR, 2.19; 95% CI, 1.66 to 2.89). This finding was replicated by analysis of HLA serotypes in 338 individuals with membranoproliferative GN and 15,614 individuals with nonimmune renal failure.ConclusionsWe found that HLA type, but not rare complement gene variation, is associated with primary membranoproliferative GN. These findings challenge the paradigm of complement gene mutations typically causing primary membranoproliferative GN and implicate an underlying autoimmune mechanism in most cases.



2021 ◽  
Author(s):  
KE Joyce ◽  
E Onabanjo ◽  
S Brownlow ◽  
F Nur ◽  
KO Olupona ◽  
...  

ABSTRACTPossession of a clinical or molecular disease label alters the context in which life-course events operate, but rarely explains the phenotypic variability observed by clinicians. Whole genome sequencing of unselected endothelial vasculopathy patients demonstrated more than a third had rare, likely deleterious variants in clinically-relevant genes unrelated to their vasculopathy (1 in 10 within platelet genes; 1 in 8 within coagulation genes; and 1 in 4 within erythrocyte hemolytic genes). High erythrocyte membrane variant rates paralleled genomic damage and prevalence indices in the general population. In blinded analyses, patients with greater hemorrhagic severity that had been attributed solely to their vasculopathy had more deleterious variants in platelet (Spearman ρ=0.25, p=0.008) and coagulation (Spearman ρ=0.21, p=0.024) genes. We conclude that rare diseases can provide insights for medicine beyond their primary pathophysiology, and propose a framework based on rare variants to inform interpretative approaches to accelerate clinical impact from whole genome sequencing.



2021 ◽  
Author(s):  
Jiru Han ◽  
Jacob E Munro ◽  
Anthony Kocoski ◽  
Alyssa E Barry ◽  
Melanie Bahlo

Short tandem repeats (STRs) are highly informative genetic markers that have been used extensively in population genetics analysis. They are an important source of genetic diversity and can also have functional impact. Despite the availability of bioinformatic methods that permit large-scale genome-wide genotyping of STRs from whole genome sequencing data, they have not previously been applied to sequencing data from large collections of malaria parasite field samples. Here, we have genotyped STRs using HipSTR in more than 3,000 Plasmodium falciparum and 174 Plasmodium vivax published whole-genome sequence data from samples collected across the globe. High levels of noise and variability in the resultant callset necessitated the development of a novel method for quality control of STR genotype calls. A set of high-quality STR loci (6,768 from P. falciparum and 3,496 from P. vivax) were used to study Plasmodium genetic diversity, population structures and genomic signatures of selection and these were compared to genome-wide single nucleotide polymorphism (SNP) genotyping data. In addition, the genome-wide information about genetic variation and other characteristics of STRs in P. falciparum and P. vivax have been made available in an interactive web-based R Shiny application PlasmoSTR (https://github.com/bahlolab/PlasmoSTR).



2019 ◽  
Vol 15 ◽  
pp. P1312-P1312
Author(s):  
Badri N. Vardarajan ◽  
James Jaworski ◽  
Gary W. Beecham ◽  
Sandra Barral ◽  
Dolly Reyes-Dumeyer ◽  
...  


Sign in / Sign up

Export Citation Format

Share Document