Estimating FST and kinship for arbitrary population structures

FST and kinship are key parameters often estimated in modern population genetics studies in order to quantitatively characterize structure and relatedness. Kinship matrices have also become a fundamental quantity used in genome-wide association studies and heritability estimation. The most frequently-used estimators of FST and kinship are method-of-moments estimators whose accuracies depend strongly on the existence of simple underlying forms of structure, such as the independent subpopulations model of non-overlapping, independently evolving subpopulations. However, modern data sets have revealed that these simple models of structure likely do not hold in many populations, including humans. In this work, we analyze the behavior of these estimators in the presence of arbitrarily-complex population structures, which results in an improved estimation framework specifically designed for arbitrary population structures. After generalizing the definition of FST to arbitrary population structures and establishing a framework for assessing bias and consistency of genome-wide estimators, we calculate the accuracy of existing FST and kinship estimators under arbitrary population structures, characterizing biases and estimation challenges unobserved under their originally-assumed models of structure. We then present our new approach, which consistently estimates kinship and FST when the minimum kinship value in the dataset is estimated consistently. We illustrate our results using simulated genotypes from an admixture model, constructing a one-dimensional geographic scenario that departs nontrivially from the independent subpopulations model. Our simulations reveal the potential for severe biases in estimates of existing approaches that are overcome by our new framework. This work may significantly improve future analyses that rely on accurate kinship and FST estimates.

Download Full-text

FST and kinship for arbitrary population structures II: Method-of-moments estimators

10.1101/083923 ◽

2016 ◽

Cited By ~ 10

Author(s):

Alejandro Ochoa ◽

John D. Storey

Keyword(s):

Method Of Moments ◽

Association Studies ◽

Data Sets ◽

Genome Wide Association Studies ◽

Genome Wide ◽

Heritability Estimation ◽

Complex Population ◽

Population Structures ◽

Moments Estimators ◽

Modern Population

AbstractFST and kinship are key parameters often estimated in modern population genetics studies in order to quantitatively characterize structure and relatedness. Kinship matrices have also become a fundamental quantity used in genome-wide association studies and heritability estimation. The most frequently used estimators of FST and kinship are method-of-moments estimators whose accuracies depend strongly on the existence of simple underlying forms of structure, such as the independent subpopulations model of non-overlapping, independently evolving subpopulations. However, modern data sets have revealed that these simple models of structure likely do not hold in many populations, including humans. In this work, we provide new results on the behavior of these estimators in the presence of arbitrarily complex population structures, which results in an improved estimation framework specifically designed for arbitrary population structures. After establishing a framework for assessing bias and consistency of genome-wide estimators, we calculate the accuracy of existing FST and kinship estimators under arbitrary population structures, characterizing biases and estimation challenges unobserved under their originally assumed models of structure. We then present our new approach, which consistently estimates kinship and FST when the minimum kinship value in the dataset is estimated consistently. We illustrate our results using simulated genotypes from an admixture model, constructing a one-dimensional geographic scenario that departs nontrivially from the independent subpopulations model. Our simulations reveal the potential for severe biases in estimates of existing approaches that are overcome by our new framework. This work may significantly improve future analyses that rely on accurate kinship and FST estimates.

Download Full-text

Heritability jointly explained by host genotype and microbiome: will improve traits prediction?

Briefings in Bioinformatics ◽

10.1093/bib/bbaa175 ◽

2020 ◽

Author(s):

Denis Awany ◽

Emile R Chimusa

Keyword(s):

Genetic Variants ◽

Association Studies ◽

Heritability Estimate ◽

Substantial Part ◽

Phenotypic Variance ◽

Genome Wide Association Studies ◽

Host Genotype ◽

Genome Wide ◽

Heritability Estimation

Abstract As we observe the $70$th anniversary of the publication by Robertson that formalized the notion of ‘heritability’, geneticists remain puzzled by the problem of missing/hidden heritability, where heritability estimates from genome-wide association studies (GWASs) fall short of that from twin-based studies. Many possible explanations have been offered for this discrepancy, including existence of genetic variants poorly captured by existing arrays, dominance, epistasis and unaccounted-for environmental factors; albeit these remain controversial. We believe a substantial part of this problem could be solved or better understood by incorporating the host’s microbiota information in the GWAS model for heritability estimation and may also increase human traits prediction for clinical utility. This is because, despite empirical observations such as (i) the intimate role of the microbiome in many complex human phenotypes, (ii) the overlap between genetic variants associated with both microbiome attributes and complex diseases and (iii) the existence of heritable bacterial taxa, current GWAS models for heritability estimate do not take into account the contributory role of the microbiome. Furthermore, heritability estimate from twin-based studies does not discern microbiome component of the observed total phenotypic variance. Here, we summarize the concept of heritability in GWAS and microbiome-wide association studies, focusing on its estimation, from a statistical genetics perspective. We then discuss a possible statistical method to incorporate the microbiome in the estimation of heritability in host GWAS.

Download Full-text

Comment on ‘Large-Scale Cognitive GWAS Meta-Analysis Reveals Tissue-Specific Neural Expression and Potential Nootropic Drug Targets’ by Lam et al.

Twin Research and Human Genetics ◽

10.1017/thg.2018.12 ◽

2018 ◽

Vol 21 (2) ◽

pp. 84-88 ◽

Cited By ~ 6

Author(s):

W. David Hill

Keyword(s):

Genetic Information ◽

Drug Targets ◽

Large Scale ◽

Association Studies ◽

Meta Analysis ◽

Genetic Correlations ◽

Data Sets ◽

Genome Wide Association Studies ◽

Nootropic Drug ◽

Genome Wide

Intelligence and educational attainment are strongly genetically correlated. This relationship can be exploited by Multi-Trait Analysis of GWAS (MTAG) to add power to Genome-wide Association Studies (GWAS) of intelligence. MTAG allows the user to meta-analyze GWASs of different phenotypes, based on their genetic correlations, to identify association's specific to the trait of choice. An MTAG analysis using GWAS data sets on intelligence and education was conducted by Lam et al. (2017). Lam et al. (2017) reported 70 loci that they described as ‘trait specific’ to intelligence. This article examines whether the analysis conducted by Lam et al. (2017) has resulted in genetic information about a phenotype that is more similar to education than intelligence.

Download Full-text

A scalable estimator of SNP heritability for Biobank-scale data

10.1101/294470 ◽

2018 ◽

Author(s):

Yue Wu ◽

Sriram Sankararaman

Keyword(s):

Variance Components ◽

Association Studies ◽

Randomized Algorithm ◽

Genome Wide Association Studies ◽

Complex Phenotypes ◽

Genome Wide ◽

Heritability Estimation ◽

Estimate Heritability ◽

Matrix Vector ◽

Scale Data

AbstractMotivationHeritability, the proportion of variation in a trait that can be explained by genetic variation, is an important parameter in efforts to understand the genetic architecture of complex phenotypes as well as in the design and interpretation of genome-wide association studies. Attempts to understand the heritability of complex phenotypes attributable to genome-wide SNP variation data has motivated the analysis of large datasets as well as the development of sophisticated tools to estimate heritability in these datasets.Linear Mixed Models (LMMs) have emerged as a key tool for heritability estimation where the parameters of the LMMs, i.e., the variance components, are related to the heritability attributable to the SNPs analyzed. Likelihood-based inference in LMMs, however, poses serious computational burdens.ResultsWe propose a scalable randomized algorithm for estimating variance components in LMMs. Our method is based on a MoM estimator that has a runtime complexity for N individuals and M SNPs (where B is a parameter that controls the number of random matrix-vector multiplications). Further, by leveraging the structure of the genotype matrix, we can reduce the time complexity to .We demonstrate the scalability and accuracy of our method on simulated as well as on empirical data. On standard hardware, our method computes heritability on a dataset of 500, 000 individuals and 100, 000 SNPs in 38 minutes.AvailabilityThe RHE-reg software is made freely available to the research community at: https://github.com/sriramlab/[email protected]

Download Full-text

The exhaustive genomic scan approach, with an application to rare-variant association analysis

10.1101/571752 ◽

2019 ◽

Author(s):

George Kanoungi ◽

Michael Nothnagel ◽

Tim Becker ◽

Dmitriy Drichel

Keyword(s):

Rare Variant ◽

Association Studies ◽

A Priori ◽

Error Rates ◽

Age Related Macular Degeneration ◽

Data Sets ◽

Rare Variant Association ◽

Age Related ◽

Genome Wide ◽

Definition Of

AbstractRegion-based genome-wide scans are usually performed by use of a priori chosen analysis regions. Such an approach will likely miss the region comprising the strongest signal and, thus, may result in increased type II error rates and decreased power. Here, we propose a genomic exhaustive scan approach that analyzes all possible subsequences and does not rely on a prior definition of the analysis regions. As a prime instance, we present a computationally ultra-efficient implementation using the rare-variant collapsing test for phenotypic association, the genomic exhaustive collapsing scan (GECS). Our implementation allows for the identification of regions comprising the strongest signals in large, genome-wide rare-variant association studies while controlling the family-wise error rate via permutation. Application of GECS to two genomic data sets revealed several novel significantly associated regions for age-related macular degeneration and for schizophrenia. Our approach also offers a high potential for genome-wide scans for selection, methylation and other analyses.

Download Full-text

Privacy-Preserving Data Sharing for Genome-Wide Association Studies

Journal of Privacy and Confidentiality ◽

10.29012/jpc.v5i1.629 ◽

2013 ◽

Vol 5 (1) ◽

Cited By ~ 24

Author(s):

Caroline Uhler ◽

Aleksandra B. Slavkovic ◽

Stephen E. Fienberg

Keyword(s):

Differential Privacy ◽

Association Studies ◽

Simulated Data ◽

Privacy Preserving ◽

External Information ◽

Genome Wide Association Studies ◽

Chi Square ◽

Gwas Study ◽

Genome Wide ◽

Definition Of

Traditional statistical methods for confidentiality protection of statistical databases do not scale well to deal with GWAS databases especially in terms of guarantees regarding protection from linkage to external information. The more recent concept of differential privacy, introduced by the cryptographic community, is an approach which provides a rigorous definition of privacy with meaningful privacy guarantees in the presence of arbitrary external information, although the guarantees may come at a serious price in terms of data utility. Building on such notions, we propose new methods to release aggregate GWAS data without compromising an individual’s privacy. We present methods for releasing differentially private minor allele frequencies, chi-square statistics and p-values. We compare these approaches on simulated data and on a GWAS study of canine hair length involving 685 dogs. We also propose a privacy-preserving method for finding genome-wide associations based on a differentially-private approach to penalized logistic regression.

Download Full-text

emeraLD: Rapid Linkage Disequilibrium Estimation with Massive Data Sets

10.1101/301366 ◽

2018 ◽

Cited By ~ 1

Author(s):

Corbin Quick ◽

Christian Fuchsberger ◽

Daniel Taliun ◽

Gonçalo Abecasis ◽

Michael Boehnke ◽

...

Keyword(s):

Linkage Disequilibrium ◽

Association Studies ◽

Random Access ◽

Supplementary Information ◽

Data Sets ◽

Genome Wide Association Studies ◽

Summary Statistics ◽

Genome Wide ◽

Wide Range ◽

Supplementary Material

AbstractSummaryEstimating linkage disequilibrium (LD) is essential for a wide range of summary statistics-based association methods for genome-wide association studies (GWAS). Large genetic data sets, e.g. the TOPMed WGS project and UK Biobank, enable more accurate and comprehensive LD estimates, but increase the computational burden of LD estimation. Here, we describe emeraLD (Efficient Methods for Estimation and Random Access of LD), a computational tool that leverages sparsity and haplotype structure to estimate LD orders of magnitude faster than existing tools.Availability and ImplementationemeraLD is implemented in C++, and is open source under GPLv3. Source code, documentation, an R interface, and utilities for analysis of summary statistics are freely available at http://github.com/statgen/[email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

TTC7B Emerges as a Novel Risk Factor for Ischemic Stroke Through the Convergence of Several Genome-Wide Approaches

Journal of Cerebral Blood Flow & Metabolism ◽

10.1038/jcbfm.2012.24 ◽

2012 ◽

Vol 32 (6) ◽

pp. 1061-1072 ◽

Cited By ~ 39

Author(s):

Tiago Krug ◽

João Paulo Gabriel ◽

Ricardo Taipa ◽

Benedita V Fonseca ◽

Sophie Domingues-Montanari ◽

...

Keyword(s):

Ischemic Stroke ◽

Mononuclear Cells ◽

Association Studies ◽

Differentially Expressed ◽

Data Sets ◽

Genome Wide Association Studies ◽

Peripheral Blood Mononuclear ◽

Pathogenic Potential ◽

Novel Approach ◽

Genome Wide

We hereby propose a novel approach to the identification of ischemic stroke (IS) susceptibility genes that involves converging data from several unbiased genetic and genomic tools. We tested the association between IS and genes differentially expressed between cases and controls, then determined which data mapped to previously reported linkage peaks and were nominally associated with stroke in published genome-wide association studies. We first performed gene expression profiling in peripheral blood mononuclear cells of 20 IS cases and 20 controls. Sixteen differentially expressed genes mapped to reported whole-genome linkage peaks, including the TTC7B gene, which has been associated with major cardiovascular disease. At the TTC7B locus, 46 tagging polymorphisms were tested for association in 565 Portuguese IS cases and 520 controls. Markers nominally associated in at least one test and defining associated haplotypes were then examined in 570 IS Spanish cases and 390 controls. Several polymorphisms and haplotypes in the intron 5–intron 6 region of TTC7B were also associated with IS risk in the Spanish and combined data sets. Multiple independent lines of evidence therefore support the role of TTC7B in stroke susceptibility, but further work is warranted to identify the exact risk variant and its pathogenic potential.

Download Full-text

2018 George Lyman Duff Memorial Lecture

Arteriosclerosis Thrombosis and Vascular Biology ◽

10.1161/atvbaha.119.311392 ◽

2019 ◽

Vol 39 (10) ◽

pp. 1925-1937 ◽

Cited By ~ 2

Author(s):

Ruth McPherson

Keyword(s):

Coronary Artery Disease ◽

Coronary Artery ◽

Disease Risk ◽

Association Studies ◽

Genome Wide Association ◽

Data Sets ◽

Genome Wide Association Studies ◽

Coronary Artery Disease Risk ◽

Genome Wide ◽

Artery Disease

Recent studies have led to a broader understanding of the genetic architecture of coronary artery disease and demonstrate that it largely derives from the cumulative effect of multiple common risk alleles individually of small effect size rather than rare variants with large effects on coronary artery disease risk. The tools applied include genome-wide association studies encompassing over 200 000 individuals complemented by bioinformatic approaches including imputation from whole-genome data sets, expression quantitative trait loci analyses, and interrogation of ENCODE (Encyclopedia of DNA Elements), Roadmap Epigenetic Project, and other data sets. Over 160 genome-wide significant loci associated with coronary artery disease risk have been identified using the genome-wide association studies approach, 90% of which are situated in intergenic regions. Here, I will describe, in part, our research over the last decade performed in collaboration with a series of bright trainees and an extensive number of groups and individuals around the world as it applies to our understanding of the genetic basis of this complex disease. These studies include computational approaches to better understand missing heritability and identify causal pathways, experimental approaches, and progress in understanding at the molecular level the function of the multiple risk loci identified and potential applications of these genomic data in clinical medicine and drug discovery.

Download Full-text

Heritability jointly Explained by Host Genotype and Microbiome:Will Improve Traits Prediction?

10.1101/2020.04.25.061226 ◽

2020 ◽

Author(s):

Denis Awany ◽

Emile R. Chimusa

Keyword(s):

Genetic Variants ◽

Association Studies ◽

Heritability Estimate ◽

Substantial Part ◽

Phenotypic Variance ◽

Genome Wide Association Studies ◽

Host Genotype ◽

Genome Wide ◽

Heritability Estimation

AbstractAs we observe the 70th anniversary of the publication by Robertson that formalized the notion of ‘heritability’, geneticists remain puzzled by the problem of missing/hidden heritability, where heritability estimates from genome-wide association studies (GWAS) fall short of that from twin-based studies. Many possible explanations have been offered for this discrepancy, including existence of genetic variants poorly captured by existing arrays, dominance, epistasis, and unaccounted-for environmental factors; albeit these remain controversial. We believe a substantial part of this problem could be solved or better understood by incorporating the host’s microbiota information in the GWAS model for heritability estimation; ultimately also increasing human traits prediction for clinical utility. This is because, despite empirical observations such as (i) the intimate role of the microbiome in many complex human phenotypes, (ii) the overlap between genetic variants associated with both microbiome attributes and complex diseases, and (iii) the existence of heritable bacterial taxa, current GWAS models for heritability estimate do not take into account the contributory role of the microbiome. Furthermore, heritability estimate from twin-based studies does not discern microbiome component of the observed total phenotypic variance. Here, we summarize the concept of heritability in GWAS and microbiome-wide association studies (MWAS), focusing on its estimation, from a statistical genetics perspective. We then discuss a possible method to incorporate the microbiome in the estimation of heritability in host GWAS.

Download Full-text