scholarly journals Scalable probabilistic PCA for large-scale genetic variation data

PLoS Genetics ◽  
2020 ◽  
Vol 16 (5) ◽  
pp. e1008773
Author(s):  
Aman Agrawal ◽  
Alec M. Chiu ◽  
Minh Le ◽  
Eran Halperin ◽  
Sriram Sankararaman
2019 ◽  
Author(s):  
Aman Agrawal ◽  
Alec M. Chiu ◽  
Minh Le ◽  
Eran Halperin ◽  
Sriram Sankararaman

AbstractPrincipal component analysis (PCA) is a key tool for understanding population structure and controlling for population stratification in genome-wide association studies (GWAS). With the advent of large-scale datasets of genetic variation, there is a need for methods that can compute principal components (PCs) with scalable computational and memory requirements. We present ProPCA, a highly scalable method based on a probabilistic generative model, which computes the top PCs on genetic variation data efficiently. We applied ProPCA to compute the top five PCs on genotype data from the UK Biobank, consisting of 488,363 individuals and 146,671 SNPs, in less than thirty minutes. Leveraging the population structure inferred by ProPCA within the White British individuals in the UK Biobank, we scanned for SNPs that are not well-explained by the PCs to identify several novel genome-wide signals of recent putative selection including missense mutations in RPGRIP1L and TLR4.Author SummaryPrincipal component analysis is a commonly used technique for understanding population structure and genetic variation. With the advent of large-scale datasets that contain the genetic information of hundreds of thousands of individuals, there is a need for methods that can compute principal components (PCs) with scalable computational and memory requirements. In this study, we present ProPCA, a highly scalable statistical method to compute genetic PCs efficiently. We systematically evaluate the accuracy and robustness of our method on large-scale simulated data and apply it to the UK Biobank. Leveraging the population structure inferred by ProPCA within the White British individuals in the UK Biobank, we identify several novel signals of putative recent selection.


2021 ◽  
Vol 53 (1) ◽  
Author(s):  
Martin Johnsson ◽  
Andrew Whalen ◽  
Roger Ros-Freixedes ◽  
Gregor Gorjanc ◽  
Ching-Yi Chen ◽  
...  

Abstract Background Meiotic recombination results in the exchange of genetic material between homologous chromosomes. Recombination rate varies between different parts of the genome, between individuals, and is influenced by genetics. In this paper, we assessed the genetic variation in recombination rate along the genome and between individuals in the pig using multilocus iterative peeling on 150,000 individuals across nine genotyped pedigrees. We used these data to estimate the heritability of recombination and perform a genome-wide association study of recombination in the pig. Results Our results confirmed known features of the recombination landscape of the pig genome, including differences in genetic length of chromosomes and marked sex differences. The recombination landscape was repeatable between lines, but at the same time, there were differences in average autosome-wide recombination rate between lines. The heritability of autosome-wide recombination rate was low but not zero (on average 0.07 for females and 0.05 for males). We found six genomic regions that are associated with recombination rate, among which five harbour known candidate genes involved in recombination: RNF212, SHOC1, SYCP2, MSH4 and HFM1. Conclusions Our results on the variation in recombination rate in the pig genome agree with those reported for other vertebrates, with a low but nonzero heritability, and the identification of a major quantitative trait locus for recombination rate that is homologous to that detected in several other species. This work also highlights the utility of using large-scale livestock data to understand biological processes.


2009 ◽  
Vol 25 (5) ◽  
pp. 662-663 ◽  
Author(s):  
Olivier Martin ◽  
Armand Valsesia ◽  
Amalio Telenti ◽  
Ioannis Xenarios ◽  
Brian J. Stevenson

Forests ◽  
2020 ◽  
Vol 11 (11) ◽  
pp. 1185
Author(s):  
Helena Eklöf ◽  
Carolina Bernhardsson ◽  
Pär K. Ingvarsson

Conifer genomes are characterized by their large size and high abundance of repetitive material, making large-scale genotyping in conifers complicated and expensive. One of the consequences of this is that it has been difficult to generate data on genome-wide levels of genetic variation. To date, researchers have mainly employed various complexity reduction techniques to assess genetic variation across the genome in different conifer species. These methods tend to capture variation in a relatively small subset of a typical conifer genome and it is currently not clear how representative such results are. Here we take advantage of data generated in the first large-scale re-sequencing effort in Norway spruce and assess how well two commonly used complexity reduction methods, targeted capture probes and genotyping by sequencing perform in capturing genome-wide variation in Norway spruce. Our results suggest that both methods perform reasonably well for assessing genetic diversity and population structure in Norway spruce (Picea abies (L.) H. Karst.). Targeted capture probes were slightly more effective than GBS, likely due to them targeting known genomic regions whereas the GBS data contains a substantially greater fraction of repetitive regions, which sometimes can be problematic for assessing genetic diversity. In conclusion, both methods are useful for genotyping large numbers of samples and they greatly reduce the cost involved with genotyping a species with such a complex genome as Norway spruce.


1995 ◽  
Vol 25 (12) ◽  
pp. 1913-1927 ◽  
Author(s):  
N.C. Wheeler ◽  
K.S. Jech ◽  
S.A. Masters ◽  
C.J. O'Brien ◽  
R.W. Stonecypher ◽  
...  

Pacific yew (Taxusbrevifolia Nutt.) is a shade-tolerant gymnosperm native to the western United States and Canada. It recently gained attention as the source of Taxol® (paclitaxel), a promising new anticancer drug. Large-scale harvest of mature Pacific yew trees for the extraction of paclitaxel has resulted in the need for improved forest management practices and an increased understanding of the amount and distribution of genetic variation in the species. We partitioned estimates of genetic variance for allozyme, metric, and taxane traits into region, population, family, and within family components in seedling common-garden tests. Genetic diversity, genetic distance, and Nei's Gst values were estimated based on gene frequencies for 22 isozyme loci. Concentrations of taxanes were determined for needles and roots using HPLC. Populations of Pacific yew are more distinct from one another than is typical of long-lived, wind-pollinated conifers in western North America, but there is little regional differentiation. Yew populations have notably less allozyme diversity than most other gymnosperms with similar life-history characteristics. Most genetic variation in all traits occurs within the population, and much of that is within family. Heritabilities for growth and taxane traits ranged from low to moderately high. Gene conservation or management strategies should include broad sampling among and within populations of Pacific yew. Opportunities for genetic selection to develop improved lines or cultivars for the production of paclitaxel exist, but use of currently domesticated yew species is more time and cost efficient.


Blood ◽  
2008 ◽  
Vol 112 (11) ◽  
pp. 1679-1679
Author(s):  
Sonja I Berndt ◽  
David C Johnson ◽  
John Crowley ◽  
Brian G Durie ◽  
Robert Hoover ◽  
...  

Abstract Genetic factors are thought to influence susceptibility to multiple myeloma, but most published studies to date have been small and limited in scope. To identify genetic polymorphisms associated with myeloma risk, we conducted a case-control study of 976 Caucasian myeloma cases enrolled from clinical trials as part of the International Myeloma Foundation’s Bank On A Cure® initiative and 3692 Caucasian controls from the three cohorts [Nurses’ Health Study (NHS), Prostate, Lung, Colorectal, and Ovarian Cancer Screening Trial (PLCO), and 1958 British Birth Cohort (BC58)] with genome-wide scan data. A candidate gene approach was taken with a preference given to single nucleotide polymorphisms (SNPs) in coding or regulatory regions. A total of 1097 SNPs with a minor allele frequency ≥1% were genotyped in the cases and at least one control population. In order to increase our statistical power, SNPs not genotyped in NHS and PLCO were imputed from the genome scan with MACH using the HapMap CEU population as a referent and included in the analysis if the quality control r2 was high (r2 ≥0.9). Logistic regression was used to estimate the odds ratios (ORs) and 95% confidence intervals (95% CIs) adjusting for age, sex, and country as appropriate. We found 26 loci to be associated with myeloma risk with P < 0.01. Of particular interest, we observed an increased risk of myeloma with variants in two genes involved in the metabolism of pyrimidines, DPYD and MTHFR. An increased risk of myeloma was found with two independent SNPs, rs1023244 and rs1399291, in DPYD (ORperGallele = 1.43, 95% CI: 1.16–1.76, P = 0.0008 and ORperTallele = 1.18, 95% CI: 1.06–1.31, P = 0.003, respectively) and with the MTHFR high activity 677C allele (rs1801133, ORperC allele = 1.18, 95% CI: 1.05–1.33, P = 0.006). We also observed significant associations for nonsynonymous SNPs in genes involved in cell cycle checkpoint regulation (ATR, P = 0.009; ZAK, P = 0.007) and the DNA damage bypass pathway (REV3L, P = 0.008), suggesting that alterations in DNA damage mediation may modulate myeloma susceptibility. In conclusion, this large study found SNPs in several pathways, including pyrimidine metabolism and DNA damage mediation, to be associated with myeloma risk. Additional studies are needed to replicate these findings and to further explore genetic variation in these regions.


2011 ◽  
Vol 69 (4) ◽  
pp. 353-359 ◽  
Author(s):  
Kwang H. Choi ◽  
Brandon W. Higgs ◽  
Jens R. Wendland ◽  
Jonathan Song ◽  
Francis J. McMahon ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document