scholarly journals A scalable estimator of SNP heritability for Biobank-scale data

2018 ◽  
Author(s):  
Yue Wu ◽  
Sriram Sankararaman

AbstractMotivationHeritability, the proportion of variation in a trait that can be explained by genetic variation, is an important parameter in efforts to understand the genetic architecture of complex phenotypes as well as in the design and interpretation of genome-wide association studies. Attempts to understand the heritability of complex phenotypes attributable to genome-wide SNP variation data has motivated the analysis of large datasets as well as the development of sophisticated tools to estimate heritability in these datasets.Linear Mixed Models (LMMs) have emerged as a key tool for heritability estimation where the parameters of the LMMs, i.e., the variance components, are related to the heritability attributable to the SNPs analyzed. Likelihood-based inference in LMMs, however, poses serious computational burdens.ResultsWe propose a scalable randomized algorithm for estimating variance components in LMMs. Our method is based on a MoM estimator that has a runtime complexity for N individuals and M SNPs (where B is a parameter that controls the number of random matrix-vector multiplications). Further, by leveraging the structure of the genotype matrix, we can reduce the time complexity to .We demonstrate the scalability and accuracy of our method on simulated as well as on empirical data. On standard hardware, our method computes heritability on a dataset of 500, 000 individuals and 100, 000 SNPs in 38 minutes.AvailabilityThe RHE-reg software is made freely available to the research community at: https://github.com/sriramlab/[email protected]

2015 ◽  
Author(s):  
Dominic Holland ◽  
Yunpeng Wang ◽  
Wesley K Thompson ◽  
Andrew Schork ◽  
Chi-Hua Chen ◽  
...  

Genome-wide Association Studies (GWAS) result in millions of summary statistics (``z-scores'') for single nucleotide polymorphism (SNP) associations with phenotypes. These rich datasets afford deep insights into the nature and extent of genetic contributions to complex phenotypes such as psychiatric disorders, which are understood to have substantial genetic components that arise from very large numbers of SNPs. The complexity of the datasets, however, poses a significant challenge to maximizing their utility. This is reflected in a need for better understanding the landscape of z-scores, as such knowledge would enhance causal SNP and gene discovery, help elucidate mechanistic pathways, and inform future study design. Here we present a parsimonious methodology for modeling effect sizes and replication probabilities that does not require raw genotype data, relying only on summary statistics from GWAS substudies, and a scheme allowing for direct empirical validation. We show that modeling z-scores as a mixture of Gaussians is conceptually appropriate, in particular taking into account ubiquitous non-null effects that are likely in the datasets due to weak linkage disequilibrium with causal SNPs. The four-parameter model allows for estimating the degree of polygenicity of the phenotype -- the proportion of SNPs (after uniform pruning, so that large LD blocks are not over-represented) likely to be in strong LD with causal/mechanistically associated SNPs -- and predicting the proportion of chip heritability explainable by genome wide significant SNPs in future studies with larger sample sizes. We apply the model to recent GWAS of schizophrenia (N=82,315) and additionally, for purposes of illustration, putamen volume (N=12,596), with approximately 9.3 million SNP z-scores in both cases. We show that, over a broad range of z-scores and sample sizes, the model accurately predicts expectation estimates of true effect sizes and replication probabilities in multistage GWAS designs. We estimate the degree to which effect sizes are over-estimated when based on linear regression association coefficients. We estimate the polygenicity of schizophrenia to be 0.037 and the putamen to be 0.001, while the respective sample sizes required to approach fully explaining the chip heritability are 106and 105. The model can be extended to incorporate prior knowledge such as pleiotropy and SNP annotation. The current findings suggest that the model is applicable to a broad array of complex phenotypes and will enhance understanding of their genetic architectures.


Author(s):  
Denis Awany ◽  
Emile R Chimusa

Abstract As we observe the $70$th anniversary of the publication by Robertson that formalized the notion of ‘heritability’, geneticists remain puzzled by the problem of missing/hidden heritability, where heritability estimates from genome-wide association studies (GWASs) fall short of that from twin-based studies. Many possible explanations have been offered for this discrepancy, including existence of genetic variants poorly captured by existing arrays, dominance, epistasis and unaccounted-for environmental factors; albeit these remain controversial. We believe a substantial part of this problem could be solved or better understood by incorporating the host’s microbiota information in the GWAS model for heritability estimation and may also increase human traits prediction for clinical utility. This is because, despite empirical observations such as (i) the intimate role of the microbiome in many complex human phenotypes, (ii) the overlap between genetic variants associated with both microbiome attributes and complex diseases and (iii) the existence of heritable bacterial taxa, current GWAS models for heritability estimate do not take into account the contributory role of the microbiome. Furthermore, heritability estimate from twin-based studies does not discern microbiome component of the observed total phenotypic variance. Here, we summarize the concept of heritability in GWAS and microbiome-wide association studies, focusing on its estimation, from a statistical genetics perspective. We then discuss a possible statistical method to incorporate the microbiome in the estimation of heritability in host GWAS.


2019 ◽  
Vol 47 (14) ◽  
pp. e79-e79
Author(s):  
Aitor González ◽  
Marie Artufel ◽  
Pascal Rihet

Abstract Genome-wide association studies (GWAS) associate single nucleotide polymorphisms (SNPs) to complex phenotypes. Most human SNPs fall in non-coding regions and are likely regulatory SNPs, but linkage disequilibrium (LD) blocks make it difficult to distinguish functional SNPs. Therefore, putative functional SNPs are usually annotated with molecular markers of gene regulatory regions and prioritized with dedicated prediction tools. We integrated associated SNPs, LD blocks and regulatory features into a supervised model called TAGOOS (TAG SNP bOOSting) and computed scores genome-wide. The TAGOOS scores enriched and prioritized unseen associated SNPs with an odds ratio of 4.3 and 3.5 and an area under the curve (AUC) of 0.65 and 0.6 for intronic and intergenic regions, respectively. The TAGOOS score was correlated with the maximal significance of associated SNPs and expression quantitative trait loci (eQTLs) and with the number of biological samples annotated for key regulatory features. Analysis of loci and regions associated to cleft lip and human adult height phenotypes recovered known functional loci and predicted new functional loci enriched in transcriptions factors related to the phenotypes. In conclusion, we trained a supervised model based on associated SNPs to prioritize putative functional regions. The TAGOOS scores, annotations and UCSC genome tracks are available here: https://tagoos.readthedocs.io.


2017 ◽  
Author(s):  
Carlo Maj ◽  
Elena Milanesi ◽  
Massimo Gennarelli ◽  
Luciano Milanesi ◽  
ivan Merelli

In complex phenotypes (e.g., psychiatric diseases) single locus tests, commonly performed with Genome-Wide Association Studies, have proven to be limited in discovering strong gene associations. A growing body of evidence suggests that epistatic non-linear effects may be responsible for complex phenotypes arising from the interaction of different biological factors. A major issue in epistasis analysis is the computational burden due to the huge number of statistical tests to be performed when considering all the potential genotype combinations. In this work, we developed a computational efficient pipeline to investigate the presence of epistasis at a genome-wide scale in bipolar disorder, which is a typical example of complex phenotype with a relevant but unexplained genetic background. By running our pipeline we were able to identify 13 epistasis interactions between variants located in genes potentially involved in biological processes associated with the analyzed phenotype.


2019 ◽  
Vol 40 (01) ◽  
pp. 012-018
Author(s):  
Paula Tejera ◽  
David Christiani

AbstractGenome-wide association studies (GWASs) in acute respiratory distress syndrome (ARDS) have been hampered by the heterogeneity of the clinical phenotypes and the large sample size requirement. As the limitations of these studies to uncover the complex genetic architecture of ARDS are evident, new approaches intended to reduce data complexity need to be applied. Intermediate phenotypes are mechanism-related manifestations of the disease, located closer to the genetic substrate than to disease phenotype, and therefore able to reflect more directly and more strongly the effect of causal genes. The dissection of complex phenotypes into less complex intermediate phenotypes is a valuable strategy to facilitate the discovery of those genetic variants whose effect is not strong enough to be detected as markers of disease in traditional GWASs. Genetic causal inference methodologies can be then applied to estimate the implication of the intermediate trait in the causal circuit between genes and disease. By following this strategy, platelet count, a relevant intermediate quantitative trait in ARDS, has been recently identified as a novel mediator in the genetic contribution to ARDS risk and mortality. The use of intermediate phenotypes and causal inference are emerging methodological and statistical strategies that can help to overcome the limitations of traditional GWASs in ARDS. Moreover, these approaches can provide evidence for the mechanisms linking genes to ARDS and help to prioritize therapeutic targets for the treatment of this devastating syndrome.


2019 ◽  
Author(s):  
M. Pérez-Enciso ◽  
L. C. Ramírez-Ayala ◽  
L.M. Zingaretti

AbstractBackgroundGenomic Prediction (GP) is the procedure whereby molecular information is used to predict complex phenotypes. Although GP can significantly enhance predictive accuracy, it can be expensive and difficult to implement. To help in designing optimum experiments, including genome wide association studies and genomic selection experiments, we have developed SeqBreed, a generic and flexible python3 forward simulator.ResultsSeqBreed accommodates sex and mitochondrion chromosomes as well as autopolyploidy. It can simulate any number of complex phenotypes determined by any number of causal loci. SeqBreed implements several GP methods, including single step GBLUP. We demonstrate its functionality with Drosophila Genome Reference Panel (DGRP) sequence data and with tetraploid potato genotypes.ConclusionsSeqBreed is a flexible and easy to use tool appropriate for optimizing GP or genome wide association studies. It incorporates some of the most popular GP methods and includes several visualization tools. Code is open and can be freely modified. Software, documentation and examples are available at https://github.com/miguelperezenciso/SeqBreed.


2020 ◽  
Author(s):  
Denis Awany ◽  
Emile R. Chimusa

AbstractAs we observe the 70th anniversary of the publication by Robertson that formalized the notion of ‘heritability’, geneticists remain puzzled by the problem of missing/hidden heritability, where heritability estimates from genome-wide association studies (GWAS) fall short of that from twin-based studies. Many possible explanations have been offered for this discrepancy, including existence of genetic variants poorly captured by existing arrays, dominance, epistasis, and unaccounted-for environmental factors; albeit these remain controversial. We believe a substantial part of this problem could be solved or better understood by incorporating the host’s microbiota information in the GWAS model for heritability estimation; ultimately also increasing human traits prediction for clinical utility. This is because, despite empirical observations such as (i) the intimate role of the microbiome in many complex human phenotypes, (ii) the overlap between genetic variants associated with both microbiome attributes and complex diseases, and (iii) the existence of heritable bacterial taxa, current GWAS models for heritability estimate do not take into account the contributory role of the microbiome. Furthermore, heritability estimate from twin-based studies does not discern microbiome component of the observed total phenotypic variance. Here, we summarize the concept of heritability in GWAS and microbiome-wide association studies (MWAS), focusing on its estimation, from a statistical genetics perspective. We then discuss a possible method to incorporate the microbiome in the estimation of heritability in host GWAS.


PLoS Genetics ◽  
2021 ◽  
Vol 17 (1) ◽  
pp. e1009241
Author(s):  
Alejandro Ochoa ◽  
John D. Storey

FST and kinship are key parameters often estimated in modern population genetics studies in order to quantitatively characterize structure and relatedness. Kinship matrices have also become a fundamental quantity used in genome-wide association studies and heritability estimation. The most frequently-used estimators of FST and kinship are method-of-moments estimators whose accuracies depend strongly on the existence of simple underlying forms of structure, such as the independent subpopulations model of non-overlapping, independently evolving subpopulations. However, modern data sets have revealed that these simple models of structure likely do not hold in many populations, including humans. In this work, we analyze the behavior of these estimators in the presence of arbitrarily-complex population structures, which results in an improved estimation framework specifically designed for arbitrary population structures. After generalizing the definition of FST to arbitrary population structures and establishing a framework for assessing bias and consistency of genome-wide estimators, we calculate the accuracy of existing FST and kinship estimators under arbitrary population structures, characterizing biases and estimation challenges unobserved under their originally-assumed models of structure. We then present our new approach, which consistently estimates kinship and FST when the minimum kinship value in the dataset is estimated consistently. We illustrate our results using simulated genotypes from an admixture model, constructing a one-dimensional geographic scenario that departs nontrivially from the independent subpopulations model. Our simulations reveal the potential for severe biases in estimates of existing approaches that are overcome by our new framework. This work may significantly improve future analyses that rely on accurate kinship and FST estimates.


2020 ◽  
Author(s):  
Minjun Huang ◽  
Britney Graham ◽  
Ge Zhang ◽  
Jacquelaine Bartlett ◽  
Jason H. Moore ◽  
...  

AbstractRecent advances in genetics have increased our understanding of epistasis as important in the genetics of complex phenotypes. However, current analytical methods often cannot detect epistasis, given the multiple testing burden. To address this, we extended our previous method, Evolutionary Triangulation (ET), that uses differences among populations in both disease prevalence and allele frequencies to filter SNPs from association studies to generate novel interaction models. We show that two-locus ET identified several co-evolving gene pairs, where both genes associate with the same disease, and that the number of such pairs is significantly greater than expected by chance. Traits found by two-locus ET included those related to pigmentation and schizophrenia. We then applied two-locus ET to the analysis of preterm birth (PTB) genetics. Using ET to filter SNPs at loci identified by genome-wide association studies (GWAS), we showed that ET derived PTB two-locus models are novel and were not seen when only the index SNPs were used to generate epistatic models. One gene pair, ADCY5 and KCNAB1 5’, was identified as significantly interacting in a model of gestational age (p as low as 3 × 10−3). Notably, the same ET SNPs in these genes showed significant interactions in three of four cohorts analyzed. The robustness of this gene pair and others, demonstrated that the ET method can be used without prior biological hypotheses based on SNP function to select variants for epistasis testing that could not be identified otherwise. Two-locus ET clearly increased the ability to identify epistasis in complex traits.


2012 ◽  
Vol 18 (5) ◽  
pp. 846-850 ◽  
Author(s):  
Karin J. H. Verweij ◽  
Anna A. E. Vinkhuyzen ◽  
Beben Benyamin ◽  
Michael T. Lynskey ◽  
Lydia Quaye ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document