Correcting statistical bias in correlation-based kinship estimators

Accurate estimate of relatedness is important for genetic data analyses, such as association mapping and heritability estimation based on data collected from genome-wide association studies. Inaccurate relatedness estimates may lead to spurious associations and biased heritability estimations. Individual-level genotype data are often used to estimate kinship coefficient between individuals. The commonly used sample correlation-based genomic relationship matrix (scGRM) method estimates kinship coefficient by calculating the average sample correlation coefficient among all single nucleotide polymorphisms (SNPs), where the observed allele frequencies are used to calculate both the expectations and variances of genotypes. Although this method is widely used, a substantial proportion of estimated kinship coefficients are negative, which are difficult to interpret. In this paper, through mathematical derivation, we show that there indeed exists bias in the estimated kinship coefficient using the scGRM method when the observed allele frequencies are regarded as true frequencies. This leads to negative bias for the average estimate of kinship among all individuals, which explains the estimated negative kinship coefficients. Based on this observation, we propose an unbiased estimation method, UKin, which can reduce the bias. We justify our improved method with rigorous mathematical proof. We have conducted simulations as well as two real data analyses to demonstrate that both bias and root mean square error in kinship coefficient estimation can be reduced by using UKin. Further simulations indicate that the power in association mapping can also be improved by using our unbiased kinship estimates to adjust for cryptic relatedness.

Download Full-text

RAFFI: Accurate and fast familial relationship inference in large scale biobank studies using RaPID

PLoS Genetics ◽

10.1371/journal.pgen.1009315 ◽

2021 ◽

Vol 17 (1) ◽

pp. e1009315

Author(s):

Ardalan Naseri ◽

Junjie Shi ◽

Xihong Lin ◽

Shaojie Zhang ◽

Degui Zhi

Keyword(s):

Large Scale ◽

Association Studies ◽

Scale Up ◽

Data Driven ◽

Genome Wide Association Studies ◽

Inference Method ◽

Genome Wide ◽

Familial Relationship ◽

Kinship Coefficients ◽

Data Driven Approach

Inference of relationships from whole-genome genetic data of a cohort is a crucial prerequisite for genome-wide association studies. Typically, relationships are inferred by computing the kinship coefficients (ϕ) and the genome-wide probability of zero IBD sharing (π0) among all pairs of individuals. Current leading methods are based on pairwise comparisons, which may not scale up to very large cohorts (e.g., sample size >1 million). Here, we propose an efficient relationship inference method, RAFFI. RAFFI leverages the efficient RaPID method to call IBD segments first, then estimate the ϕ and π0 from detected IBD segments. This inference is achieved by a data-driven approach that adjusts the estimation based on phasing quality and genotyping quality. Using simulations, we showed that RAFFI is robust against phasing/genotyping errors, admix events, and varying marker densities, and achieves higher accuracy compared to KING, the current leading method, especially for more distant relatives. When applied to the phased UK Biobank data with ~500K individuals, RAFFI is approximately 18 times faster than KING. We expect RAFFI will offer fast and accurate relatedness inference for even larger cohorts.

Download Full-text

Heritability jointly explained by host genotype and microbiome: will improve traits prediction?

Briefings in Bioinformatics ◽

10.1093/bib/bbaa175 ◽

2020 ◽

Author(s):

Denis Awany ◽

Emile R Chimusa

Keyword(s):

Genetic Variants ◽

Association Studies ◽

Heritability Estimate ◽

Substantial Part ◽

Phenotypic Variance ◽

Genome Wide Association Studies ◽

Host Genotype ◽

Genome Wide ◽

Heritability Estimation

Abstract As we observe the $70$th anniversary of the publication by Robertson that formalized the notion of ‘heritability’, geneticists remain puzzled by the problem of missing/hidden heritability, where heritability estimates from genome-wide association studies (GWASs) fall short of that from twin-based studies. Many possible explanations have been offered for this discrepancy, including existence of genetic variants poorly captured by existing arrays, dominance, epistasis and unaccounted-for environmental factors; albeit these remain controversial. We believe a substantial part of this problem could be solved or better understood by incorporating the host’s microbiota information in the GWAS model for heritability estimation and may also increase human traits prediction for clinical utility. This is because, despite empirical observations such as (i) the intimate role of the microbiome in many complex human phenotypes, (ii) the overlap between genetic variants associated with both microbiome attributes and complex diseases and (iii) the existence of heritable bacterial taxa, current GWAS models for heritability estimate do not take into account the contributory role of the microbiome. Furthermore, heritability estimate from twin-based studies does not discern microbiome component of the observed total phenotypic variance. Here, we summarize the concept of heritability in GWAS and microbiome-wide association studies, focusing on its estimation, from a statistical genetics perspective. We then discuss a possible statistical method to incorporate the microbiome in the estimation of heritability in host GWAS.

Download Full-text

P708Identification of 26 novel loci that confer susceptibility to early-onset coronary artery disease in a Japanese population

European Heart Journal ◽

10.1093/eurheartj/ehz747.0313 ◽

2019 ◽

Vol 40 (Supplement_1) ◽

Author(s):

M Oguri ◽

K Kato ◽

H Horibe ◽

T Fujimaki ◽

J Sakuma ◽

...

Keyword(s):

Coronary Artery Disease ◽

Genetic Variants ◽

Early Onset ◽

Association Studies ◽

Allele Frequencies ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Exact Test ◽

Genome Wide ◽

Artery Disease

Abstract Background Early-onset coronary artery disease (CAD) has a strong genetic component. Although genome-wide association studies have identified various genes and loci significantly associated with CAD mainly in European ancestry populations, genetic variants that contribute to susceptibility to this condition in Japanese individuals remain to be identified definitively. Purpose The purpose of the study was to identify genetic variants that confer susceptibility to early-onset CAD in Japanese. We have now performed exome-wide association studies (EWASs) in subjects with early-onset CAD and controls. Methods A total of 7256 individuals aged ≤65 years was enrolled in the study. The EWAS was conducted with 1482 subjects with CAD and 5774 controls. Genotyping of single nucleotide polymorphisms (SNPs) was performed with Illumina Human Exome-12 DNA Analysis BeadChip or Infinium Exome-24 BeadChip arrays. The relation of allele frequencies for 31,465 SNPs that passed quality control to CAD was examined with Fisher's exact test. To compensate for multiple comparisons of allele frequencies with CAD, we applied a false discovery rate (FDR) of <0.05 for statistical significance of association. Results The relation of allele frequencies for 31,465 SNPs to CAD with the use of Fisher's exact test showed that 170 SNPs were significantly (FDR <0.05) associated with CAD. Multivariable logistic regression analysis with adjustment for age, sex, and the prevalence of hypertension, diabetes mellitus, and dyslipidemia revealed that 162 SNPs were significantly (P<0.05) related to CAD. A stepwise forward selection procedure was performed to examine the effects of genotypes for the 162 SNPs on CAD. The 54 SNPs were significant (P<0.05) and independent [coefficient of determination (R2), 0.0008 to 0.0297] determinants of CAD. These SNPs together accounted for 15.5% of the cause of CAD. After examination of results from previous genome-wide association studies and linkage disequilibrium of the identified SNPs, we newly identified 21 genes (RNF2, YEATS2, USP45, ITGB8, TNS3, FAM170B-AS1, PRKG1, BTRC, MKI67, STIM1, OR52E4, KIAA1551, MON2, PLUT, LINC00354, TRPM1, ADAT1, KRT27, LIPE, GFY, EIF3L) and five chromosomal regions (2p13, 4q31.2, 5q12, 13q34, 20q13.2) that were significantly associated with CAD. Gene ontology analysis showed that various biological functions were predicted in the 18 genes identified in the present study. The network analysis revealed that the 18 genes had potential direct or indirect interactions with the 30 genes previously shown to be associated with CAD or with the 228 genes identified in previous genome-wide association studies of CAD. Conclusion We have newly identified 26 loci that confer susceptibility to CAD. Determination of genotypes for the SNPs at these loci may prove informative for assessment of the genetic risk for CAD in Japanese.

Download Full-text

Using the unified relationship matrix adjusted by breed-wise allele frequencies in genomic evaluation of a multibreed population

Journal of Dairy Science ◽

10.3168/jds.2013-7167 ◽

2014 ◽

Vol 97 (2) ◽

pp. 1117-1127 ◽

Cited By ~ 17

Author(s):

M.L. Makgahlela ◽

I. Strandén ◽

U.S. Nielsen ◽

M.J. Sillanpää ◽

E.A. Mäntysaari

Keyword(s):

Allele Frequencies ◽

Relationship Matrix ◽

Genomic Evaluation

Download Full-text

Candidate gene-based association genetics analysis of herbage quality traits in perennial ryegrass (Lolium perenne L.)

Crop and Pasture Science ◽

10.1071/cp12392 ◽

2013 ◽

Vol 64 (3) ◽

pp. 244 ◽

Cited By ~ 8

Author(s):

L. W. Pembleton ◽

J. Wang ◽

N. O. I. Cogan ◽

J. E. Pryce ◽

G. Ye ◽

...

Keyword(s):

Association Mapping ◽

Candidate Gene ◽

Perennial Ryegrass ◽

Association Studies ◽

Phenotypic Variance ◽

Genome Wide Association Studies ◽

Quality Traits ◽

Near Infrared Reflectance ◽

Genome Wide ◽

Herbage Quality

Due to the complex genetic architecture of perennial ryegrass, based on an obligate outbreeding reproductive habit, association-mapping approaches to genetic dissection offer the potential for effective identification of genetic marker–trait linkages. Associations with genes for agronomic characters, such as components of herbage nutritive quality, may then be utilised for accelerated cultivar improvement using advanced molecular breeding practices. The objective of the present study was to evaluate the presence of such associations for a broad range of candidate genes involved in pathways of cell wall biosynthesis and carbohydrate metabolism. An association-mapping panel composed from a broad range of non-domesticated and varietal sources was assembled and assessed for genome-wide sequence polymorphism. Removal of significant population structure obtained a diverse meta-population (220 genotypes) suitable for association studies. The meta-population was established with replication as a spaced-plant field trial. All plants were genotyped with a cohort of candidate gene-derived single nucleotide polymorphism (SNP) markers. Herbage samples were harvested at both vegetative and reproductive stages and were measured for a range of herbage quality traits using near infrared reflectance spectroscopy. Significant associations were identified for ~50% of the genes, accounting for small but significant components of phenotypic variance. The identities of genes with associated SNPs were largely consistent with detailed knowledge of ryegrass biology, and they are interpreted in terms of known biochemical and physiological processes. Magnitudes of effect of observed marker–trait gene association were small, indicating that future activities should focus on genome-wide association studies in order to identify the majority of causal mutations for complex traits such as forage quality.

Download Full-text

Genome-wide association and genomic selection in animal breedingThis article is one of a selection of papers from the conference “Exploiting Genome-wide Association in Oilseed Brassicas: a model for genetic improvement of major OECD crops for sustainable farming”.

Genome ◽

10.1139/g10-076 ◽

2010 ◽

Vol 53 (11) ◽

pp. 876-883 ◽

Cited By ~ 135

Author(s):

Ben Hayes ◽

Mike Goddard

Keyword(s):

Genomic Selection ◽

Complex Traits ◽

Association Studies ◽

Genome Wide Association ◽

Relationship Matrix ◽

Genome Wide Association Studies ◽

Simple Method ◽

Breeding Values ◽

Genome Wide ◽

A Genome

Results from genome-wide association studies in livestock, and humans, has lead to the conclusion that the effect of individual quantitative trait loci (QTL) on complex traits, such as yield, are likely to be small; therefore, a large number of QTL are necessary to explain genetic variation in these traits. Given this genetic architecture, gains from marker-assisted selection (MAS) programs using only a small number of DNA markers to trace a limited number of QTL is likely to be small. This has lead to the development of alternative technology for using the available dense single nucleotide polymorphism (SNP) information, called genomic selection. Genomic selection uses a genome-wide panel of dense markers so that all QTL are likely to be in linkage disequilibrium with at least one SNP. The genomic breeding values are predicted to be the sum of the effect of these SNPs across the entire genome. In dairy cattle breeding, the accuracy of genomic estimated breeding values (GEBV) that can be achieved and the fact that these are available early in life have lead to rapid adoption of the technology. Here, we discuss the design of experiments necessary to achieve accurate prediction of GEBV in future generations in terms of the number of markers necessary and the size of the reference population where marker effects are estimated. We also present a simple method for implementing genomic selection using a genomic relationship matrix. Future challenges discussed include using whole genome sequence data to improve the accuracy of genomic selection and management of inbreeding through genomic relationships.

Download Full-text

Multiplex Confounding Factor Correction for Genomic Association Mapping with Squared Sparse Linear Mixed Model

10.1101/228114 ◽

2017 ◽

Author(s):

Haohan Wang ◽

Xiang Liu ◽

Yunpeng Xiao ◽

Ming Xu ◽

Eric P. Xing

Keyword(s):

Population Structure ◽

Association Mapping ◽

Complex Traits ◽

Association Studies ◽

Phenotypic Variability ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Confounding Factors ◽

Genetic Loci ◽

Genome Wide

AbstractGenome-wide Association Study has presented a promising way to understand the association between human genomes and complex traits. Many simple polymorphic loci have been shown to explain a significant fraction of phenotypic variability. However, challenges remain in the non-triviality of explaining complex traits associated with multifactorial genetic loci, especially considering the confounding factors caused by population structure, family structure, and cryptic relatedness. In this paper, we propose a Squared-LMM (LMM2) model, aiming to jointly correct population and genetic confounding factors. We offer two strategies of utilizing LMM2 for association mapping: 1) It serves as an extension of univariate LMM, which could effectively correct population structure, but consider each SNP in isolation. 2) It is integrated with the multivariate regression model to discover association relationship between complex traits and multifactorial genetic loci. We refer to this second model as sparse Squared-LMM (sLMM2). Further, we extend LMM2/sLMM2 by raising the power of our squared model to the LMMn/sLMMn model. We demonstrate the practical use of our model with synthetic phenotypic variants generated from genetic loci of Arabidopsis Thaliana. The experiment shows that our method achieves a more accurate and significant prediction on the association relationship between traits and loci. We also evaluate our models on collected phenotypes and genotypes with the number of candidate genes that the models could discover. The results suggest the potential and promising usage of our method in genome-wide association studies.

Download Full-text

A scalable estimator of SNP heritability for Biobank-scale data

10.1101/294470 ◽

2018 ◽

Author(s):

Yue Wu ◽

Sriram Sankararaman

Keyword(s):

Variance Components ◽

Association Studies ◽

Randomized Algorithm ◽

Genome Wide Association Studies ◽

Complex Phenotypes ◽

Genome Wide ◽

Heritability Estimation ◽

Estimate Heritability ◽

Matrix Vector ◽

Scale Data

AbstractMotivationHeritability, the proportion of variation in a trait that can be explained by genetic variation, is an important parameter in efforts to understand the genetic architecture of complex phenotypes as well as in the design and interpretation of genome-wide association studies. Attempts to understand the heritability of complex phenotypes attributable to genome-wide SNP variation data has motivated the analysis of large datasets as well as the development of sophisticated tools to estimate heritability in these datasets.Linear Mixed Models (LMMs) have emerged as a key tool for heritability estimation where the parameters of the LMMs, i.e., the variance components, are related to the heritability attributable to the SNPs analyzed. Likelihood-based inference in LMMs, however, poses serious computational burdens.ResultsWe propose a scalable randomized algorithm for estimating variance components in LMMs. Our method is based on a MoM estimator that has a runtime complexity for N individuals and M SNPs (where B is a parameter that controls the number of random matrix-vector multiplications). Further, by leveraging the structure of the genotype matrix, we can reduce the time complexity to .We demonstrate the scalability and accuracy of our method on simulated as well as on empirical data. On standard hardware, our method computes heritability on a dataset of 500, 000 individuals and 100, 000 SNPs in 38 minutes.AvailabilityThe RHE-reg software is made freely available to the research community at: https://github.com/sriramlab/[email protected]

Download Full-text

Case-control association mapping without cases

10.1101/045831 ◽

2016 ◽

Cited By ~ 2

Author(s):

Jimmy Z Liu ◽

Yaniv Erlich ◽

Joseph K Pickrell

Keyword(s):

Association Mapping ◽

Complex Traits ◽

Disease Risk ◽

Association Studies ◽

Meta Analysis ◽

Large Population ◽

Case Control ◽

Genome Wide Association Studies ◽

The Uk ◽

Control Association

AbstractThe case-control association study is a powerful method for identifying genetic variants that influence disease risk. However, the collection of cases can be time-consuming and expensive; if a disease occurs late in life or is rapidly lethal, it may be more practical to identify family members of cases. Here, we show that replacing cases with their first-degree relatives enables genome-wide association studies by proxy (GWAX). In randomly-ascertained cohorts, this approach enables previously infeasible studies of diseases that are absent (or nearly absent) in the cohort. As an illustration, we performed GWAX of 12 common diseases in 116,196 individuals from the UK Biobank. By combining these results with published GWAS summary statistics in a meta-analysis, we replicated established risk loci and identified 17 newly associated risk loci: four in Alzheimer’s disease, eight in coronary artery disease, and five in type 2 diabetes. In addition to informing disease biology, our results demonstrate the utility of association mapping using family history of disease as a phenotype to be mapped. We anticipate that this approach will prove useful in future genetic studies of complex traits in large population cohorts.

Download Full-text

Estimating SNP heritability in presence of population substructure in biobank-scale datasets

10.1101/2020.08.05.236901 ◽

2020 ◽

Author(s):

Zhaotong Lin ◽

Souvik Seal ◽

Saonli Basu

Keyword(s):

Complex Traits ◽

Population Stratification ◽

Mixed Model ◽

Linear Mixed Model ◽

Population Substructure ◽

Relationship Matrix ◽

Phenotypic Variance ◽

Genetic Contribution ◽

Heritability Estimation ◽

The Impact

AbstractSNP heritability of a trait is measured by the proportion of total variance explained by the additive effects of genome-wide single nucleotide polymorphisms (SNPs). Linear mixed models are routinely used to estimate SNP heritability for many complex traits. The basic concept behind this approach is to model genetic contribution as a random effect, where the variance of this genetic contribution attributes to the heritability of the trait. This linear mixed model approach requires estimation of ‘relatedness’ among individuals in the sample, which is usually captured by estimating a genetic relationship matrix (GRM). Heritability is estimated by the restricted maximum likelihood (REML) or method of moments (MOM) approaches, and this estimation relies heavily on the GRM computed from the genetic data on individuals. Presence of population substructure in the data could significantly impact the GRM estimation and may introduce bias in heritability estimation. The common practice of accounting for such population substructure is to adjust for the top few principal components of the GRM as covariates in the linear mixed model. Here we propose an alternative way of estimating heritability in multi-ethnic studies. Our proposed approach is a MOM estimator derived from the Haseman-Elston regression and gives an asymptotically unbiased estimate of heritability in presence of population stratification. It introduces adjustments for the population stratification in a second-order estimating equation and allows for the total phenotypic variance vary by ethnicity. We study the performance of different MOM and REML approaches in presence of population stratification through extensive simulation studies. We estimate the heritability of height, weight and other anthropometric traits in the UK Biobank cohort to investigate the impact of subtle population substructure on SNP heritability estimation.

Download Full-text