Kernel-based genetic association analysis for microbiome phenotypes identifies host genetic drivers of beta-diversity

Understanding human genetic influences on the gut microbiota helps elucidate the mechanisms by which genetics affects health outcomes. We propose a novel approach, the covariate-adjusted kernel RV (KRV) framework, to map genetic variants associated with microbiome beta-diversity, which focuses on overall shifts in the microbiota. The proposed KRV framework improves statistical power by capturing intrinsic structure within the genetic and microbiome data while reducing the multiple-testing burden. We apply the covariate-adjusted KRV test to the Hispanic Community Health Study/Study of Latinos in a genome-wide association analysis (first gene-level, then variant-level) for microbiome beta-diversity. We have identified an immunity-related gene, IL23R, reported in previous association studies and discovered 3 other novel genes, 2 of which are involved in immune functions or autoimmune disorders. Our findings highlight the value of the KRV as a powerful microbiome GWAS approach and support an important role of immunity-related genes in shaping the gut microbiome composition.

Download Full-text

2dFDR: a new approach to confounder adjustment substantially increases detection power in omics association studies

Genome Biology ◽

10.1186/s13059-021-02418-8 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Sangyoon Yi ◽

Xianyang Zhang ◽

Lu Yang ◽

Jinyan Huang ◽

Yuanhang Liu ◽

...

Keyword(s):

Multiple Testing ◽

Statistical Power ◽

Association Studies ◽

Control Procedure ◽

Multiple Testing Correction ◽

New Approach ◽

False Discovery ◽

Traditional Procedure ◽

Extensive Evaluation ◽

Confounder Adjustment

AbstractOne challenge facing omics association studies is the loss of statistical power when adjusting for confounders and multiple testing. The traditional statistical procedure involves fitting a confounder-adjusted regression model for each omics feature, followed by multiple testing correction. Here we show that the traditional procedure is not optimal and present a new approach, 2dFDR, a two-dimensional false discovery rate control procedure, for powerful confounder adjustment in multiple testing. Through extensive evaluation, we demonstrate that 2dFDR is more powerful than the traditional procedure, and in the presence of strong confounding and weak signals, the power improvement could be more than 100%.

Download Full-text

The harmonic mean p-value for combining dependent tests

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1814092116 ◽

2019 ◽

Vol 116 (4) ◽

pp. 1195-1200 ◽

Cited By ~ 43

Author(s):

Daniel J. Wilson

Keyword(s):

Multiple Testing ◽

Statistical Power ◽

Scientific Discovery ◽

Association Studies ◽

Harmonic Mean ◽

P Value ◽

Genome Wide Association Studies ◽

Familywise Error Rate ◽

Significance Threshold ◽

Genome Wide

Analysis of “big data” frequently involves statistical comparison of millions of competing hypotheses to discover hidden processes underlying observed patterns of data, for example, in the search for genetic determinants of disease in genome-wide association studies (GWAS). Controlling the familywise error rate (FWER) is considered the strongest protection against false positives but makes it difficult to reach the multiple testing-corrected significance threshold. Here, I introduce the harmonic mean p-value (HMP), which controls the FWER while greatly improving statistical power by combining dependent tests using generalized central limit theorem. I show that the HMP effortlessly combines information to detect statistically significant signals among groups of individually nonsignificant hypotheses in examples of a human GWAS for neuroticism and a joint human–pathogen GWAS for hepatitis C viral load. The HMP simultaneously tests all ways to group hypotheses, allowing the smallest groups of hypotheses that retain significance to be sought. The power of the HMP to detect significant hypothesis groups is greater than the power of the Benjamini–Hochberg procedure to detect significant hypotheses, although the latter only controls the weaker false discovery rate (FDR). The HMP has broad implications for the analysis of large datasets, because it enhances the potential for scientific discovery.

Download Full-text

Retrospective Association Analysis of Longitudinal Binary Traits Identifies Important Loci and Pathways in Cocaine Use

Genetics ◽

10.1534/genetics.119.302598 ◽

2019 ◽

Vol 213 (4) ◽

pp. 1225-1236 ◽

Cited By ~ 1

Author(s):

Weimiao Wu ◽

Zhong Wang ◽

Ke Xu ◽

Xinyu Zhang ◽

Amei Amei ◽

...

Keyword(s):

Association Analysis ◽

Binary Data ◽

Association Studies ◽

Association Test ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Cocaine Use ◽

Genome Wide ◽

A Genome ◽

Time Varying Covariates

Longitudinal phenotypes have been increasingly available in genome-wide association studies (GWAS) and electronic health record-based studies for identification of genetic variants that influence complex traits over time. For longitudinal binary data, there remain significant challenges in gene mapping, including misspecification of the model for phenotype distribution due to ascertainment. Here, we propose L-BRAT (Longitudinal Binary-trait Retrospective Association Test), a retrospective, generalized estimating equation-based method for genetic association analysis of longitudinal binary outcomes. We also develop RGMMAT, a retrospective, generalized linear mixed model-based association test. Both tests are retrospective score approaches in which genotypes are treated as random conditional on phenotype and covariates. They allow both static and time-varying covariates to be included in the analysis. Through simulations, we illustrated that retrospective association tests are robust to ascertainment and other types of phenotype model misspecification, and gain power over previous association methods. We applied L-BRAT and RGMMAT to a genome-wide association analysis of repeated measures of cocaine use in a longitudinal cohort. Pathway analysis implicated association with opioid signaling and axonal guidance signaling pathways. Lastly, we replicated important pathways in an independent cocaine dependence case-control GWAS. Our results illustrate that L-BRAT is able to detect important loci and pathways in a genome scan and to provide insights into genetic architecture of cocaine use.

Download Full-text

Power Estimation for Gene-Longevity Association Analysis Using Concordant Twins

Genetics Research International ◽

10.1155/2014/154204 ◽

2014 ◽

Vol 2014 ◽

pp. 1-8

Author(s):

Qihua Tan ◽

Jing Hua Zhao ◽

Torben Kruse ◽

Kaare Christensen

Keyword(s):

Association Study ◽

Genetic Association ◽

Association Analysis ◽

Statistical Power ◽

Association Studies ◽

Genetic Association Studies ◽

Small Sample ◽

Identical Twins ◽

Human Longevity ◽

Sample Sizes

Statistical power is one of the major concerns in genetic association studies. Related individuals such as twins are valuable samples for genetic studies because of their genetic relatedness. Phenotype similarity in twin pairs provides evidence of genetic control over the phenotype variation in a population. The genetic association study on human longevity, a complex trait that is under control of both genetic and environmental factors, has been confronted by the small sample sizes of longevity subjects which limit statistical power. Twin pairs concordant for longevity have increased probability for carrying beneficial genes and thus are useful samples for gene-longevity association analysis. We conducted a computer simulation to estimate the power of association study using longevity concordant twin pairs. We observed remarkable power increases in using singletons from longevity concordant twin pairs as cases in comparison with cases of sporadic proband. A similar power would require doubled sample sizes for fraternal twins than for identical twins who are concordant for longevity suggesting that longevity concordant identical twins are more efficient samples than fraternal twins. We also observed an approximate of 2- to 3-fold increase in sample sizes needed for longevity cutoff at age 90 as compared with that at age 95. Overall, our results showed high value of twins in genetic association studies on human longevity.

Download Full-text

FIQT: a simple, powerful method to accurately estimate effect sizes in genome scans

10.1101/019299 ◽

2015 ◽

Cited By ~ 1

Author(s):

Tim B Bigdeli ◽

Donghyung Lee ◽

Brien P Riley ◽

Vladimir I Vladimirov ◽

Ayman H Fanous ◽

...

Keyword(s):

Multiple Testing ◽

Empirical Bayes ◽

Association Studies ◽

Genome Wide Association Studies ◽

Genome Scans ◽

P Values ◽

Psychiatric Genetic ◽

Genome Wide ◽

A Genome ◽

Z Scores

Genome scans, including both genome-wide association studies and deep sequencing, continue to discover a growing number of significant association signals for various traits. However, often variants meeting genome-wide significance criteria explain far less of the overall trait variance than “sub-threshold” association signals. To extract these sub-threshold signals, there is a need for methods which accurately estimate the mean of all (normally-distributed) test-statistics from a genome scan (i.e., Z-scores). This is currently achieved by the difficult procedures of adjusting all Z-score (χ_1^2) statistics for “winner’s curse” (multiple testing). Given that multiple testing adjustments are much simpler for p-values, we propose a method for estimating Z-scores means by i) first adjusting their p-values for multiple testing and then ii) transforming the adjusted p-values to upper tail Z-scores with the sign of the original statistics. Because a False Discovery Rate (FDR) procedure is used for multiple testing adjustment, we denote this method FDR Inverse Quantile Transformation (FIQT). When compared to competitors, e.g. Empirical Bayes (including proposed improvements), FIQT is more i) accurate and ii) computationally efficient by orders of magnitude. Its accuracy advantage is substantial at larger sample sizes and/or moderate numbers of association signals. Practical application of FIQT to Z-scores from the first Psychiatric Genetic Consortium (PGC) schizophrenia predicts a non-trivial fraction of the significant signal regions from the subsequent published PGC schizophrenia studies. Finally, we suggest that FIQT might be i) used to improve subject level risk prediction and ii) further improved by modelling the noncentrality of χ_1^2 statistics.

Download Full-text

Optimal Genomic Control in Large-scale Genetic Associations for Binary Diseases

10.21203/rs.3.rs-318017/v2 ◽

2021 ◽

Author(s):

Runqing Yang ◽

Yuxin Song ◽

Li Jiang ◽

Zhiyu Hao ◽

Runqing Yang

Keyword(s):

Multiple Testing ◽

Statistical Power ◽

Large Scale ◽

Association Studies ◽

Joint Analysis ◽

Genome Wide Association Studies ◽

Genetic Associations ◽

Genomic Heritability ◽

Large Scale Data ◽

Genome Wide

Abstract Complex computation and approximate solution hinder the application of generalized linear mixed models (GLMM) into genome-wide association studies. We extended GRAMMAR to handle binary diseases by considering genomic breeding values (GBVs) estimated in advance as a known predictor in genomic logit regression, and then controlled polygenic effects by regulating downward genomic heritability. Using simulations and case analyses, we showed in optimizing GRAMMAR, polygenic effects and genomic controls could be evaluated using the fewer sampling markers, which extremely simplified GLMM-based association analysis in large-scale data. In addition, joint analysis for quantitative trait nucleotide (QTN) candidates chosen by multiple testing offered significant improved statistical power to detect QTNs over existing methods.

Download Full-text

CALDERA: Finding all significant de Bruijn subgraphs for bacterial GWAS

10.1101/2021.11.05.467462 ◽

2021 ◽

Author(s):

Hector Roux de Bezieux ◽

Leandro Lima ◽

Fanny Perraudeau ◽

Arnaud Mary ◽

Sandrine Dudoit ◽

...

Keyword(s):

Statistical Power ◽

Association Studies ◽

Bacterial Species ◽

De Bruijn Graph ◽

Testable Hypothesis ◽

Genome Wide Association Studies ◽

Nucleotide Polymorphisms ◽

A Genome ◽

De Bruijn ◽

Connected Subgraphs

Genome wide association studies (GWAS), aiming to find genetic variants associated with a trait, have widely been used on bacteria to identify genetic determinants of drug resistance or hypervirulence. Recent bacterial GWAS methods usually rely on k-mers, whose presence in a genome can denote variants ranging from single nucleotide polymorphisms to mobile genetic elements. Since many bacterial species include genes that are not shared among all strains, this approach avoids the reliance on a common reference genome. However, the same gene can exist in slightly different versions across different strains, leading to diluted effects when trying to detect its association to a phenotype through k-mer based GWAS. Here we propose to overcome this by testing covariates built from closed connected subgraphs of the De Bruijn graph defined over genomic k-mers. These covariates are able to capture polymorphic genes as a single entity, improving k-mer based GWAS in terms of power and interpretability. As the number of subgraphs is exponential in the number of nodes in the DBG, a method naively testing all possible subgraphs would result in very low statistical power due to multiple testing corrections, and the mere exploration of these subgraphs would quickly become computationally intractable. The concept of testable hypothesis has successfully been used to address both problems in similar contexts. We leverage this concept to test all closed connected subgraphs by proposing a novel enumeration scheme for these objects which fully exploits the pruning opportunity offered by testability, resulting in drastic improvements in computational efficiency. We illustrate this on both real and simulated datasets and also demonstrate how considering subgraphs leads to a more powerful and interpretable method. Our method integrates with existing visual tools to facilitate interpretation. We also provide an implementation of our method, as well as code to reproduce all results at https://github.com/HectorRDB/Caldera_Recomb.

Download Full-text

Genome-wide association analysis of type 2 diabetes in the EPIC-InterAct study

Scientific Data ◽

10.1038/s41597-020-00716-7 ◽

2020 ◽

Vol 7 (1) ◽

Author(s):

Lina Cai ◽

Eleanor Wheeler ◽

Nicola D. Kerrison ◽

Jian’an Luan ◽

Panos Deloukas ◽

...

Keyword(s):

Type 2 Diabetes ◽

Association Analysis ◽

Association Studies ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Global Public Health ◽

Genome Wide Association Analysis ◽

Genome Wide ◽

A Genome

AbstractType 2 diabetes (T2D) is a global public health challenge. Whilst the advent of genome-wide association studies has identified >400 genetic variants associated with T2D, our understanding of its biological mechanisms and translational insights is still limited. The EPIC-InterAct project, centred in 8 countries in the European Prospective Investigations into Cancer and Nutrition study, is one of the largest prospective studies of T2D. Established as a nested case-cohort study to investigate the interplay between genetic and lifestyle behavioural factors on the risk of T2D, a total of 12,403 individuals were identified as incident T2D cases, and a representative sub-cohort of 16,154 individuals was selected from a larger cohort of 340,234 participants with a follow-up time of 3.99 million person-years. We describe the results from a genome-wide association analysis between more than 8.9 million SNPs and T2D risk among 22,326 individuals (9,978 cases and 12,348 non-cases) from the EPIC-InterAct study. The summary statistics to be shared provide a valuable resource to facilitate further investigations into the genetics of T2D.

Download Full-text

Meta-Analysis of Family-Based and Case-Control Genetic Association Studies that Use the Same Cases

Statistical Applications in Genetics and Molecular Biology ◽

10.2202/1544-6115.1640 ◽

2011 ◽

Vol 10 (1) ◽

Cited By ~ 6

Author(s):

Pantelis G Bagos ◽

Niki L Dimou ◽

Theodore D Liakopoulos ◽

Georgios K Nikolopoulos

Keyword(s):

Statistical Power ◽

Genome Wide Association Study ◽

Association Studies ◽

Meta Analysis ◽

Case Control ◽

Single Step ◽

Individual Data ◽

A Genome ◽

Syncytial Virus ◽

Family Based

In many cases in genetic epidemiology, the investigators in an effort to control for different sources of confounding and simultaneously to increase the power perform a family-based and a population-based case-control study within the same population, using the same or largely overlapping, set of cases. Various methods have been proposed for performing a combined analysis, but they all require access to individual data that are difficult to gather in a meta-analysis. Here, we propose a simple and efficient summary-based method for performing the meta-analysis. The key point, contrary to the methods presented earlier that need individual data, is the calculation of the covariance between the study estimates (log-Odds Ratios), using only data derived from the literature in the form of a 2x2 contingency table. Afterwards, the studies can easily be combined either in a two-step procedure using traditional methods for univariate meta-analysis or in a single-step approach using hierarchical models. In any case, the meta-analysis can be performed using standard software and because of the increased sample size the statistical power of the meta-analysis is increased whereas the procedure allows performing several diagnostics (publication bias, cumulative meta-analysis, sensitivity analysis). The method is evaluated on a dataset of 356 Single Nucleotide polymorphisms (SNPs) which were evaluated for their potential association with Respiratory Syncytial Virus Bronchiolitis (RSV) and subsequently is applied in a meta-analysis concerning the association of the 10-Repeat Allele of a VNTR Polymorphism in the 3’-UTR of Dopamine Transporter Gene with Attention Deficit Hyperactivity Disorder (ADHD), as well as in a genome-wide association study for Multiple Sclerosis. Implementation of the method is straightforward and in the Appendix, a Stata program is given for implementing the methods presented here.

Download Full-text

Retrospective Association Analysis of Longitudinal Binary Traits Identifies Important Loci and Pathways in Cocaine Use

10.1101/628180 ◽

2019 ◽

Author(s):

Weimiao Wu ◽

Zhong Wang ◽

Ke Xu ◽

Xinyu Zhang ◽

Amei Amei ◽

...

Keyword(s):

Association Analysis ◽

Complex Traits ◽

Binary Data ◽

Association Studies ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Cocaine Use ◽

Genome Wide ◽

A Genome ◽

Time Varying Covariates

SUMMARYLongitudinal phenotypes have been increasingly available in genome-wide association studies (GWAS) and electronic health record-based studies for identification of genetic variants that influence complex traits over time. For longitudinal binary data, there remain significant challenges in gene mapping, including misspecification of the model for the phenotype distribution due to ascertainment. Here, we propose L-BRAT, a retrospective, generalized estimating equations-based method for genetic association analysis of longitudinal binary outcomes. We also develop RGMMAT, a retrospective, generalized linear mixed model-based association test. Both tests are retrospective score approaches in which genotypes are treated as random conditional on phenotype and covariates. They allow both static and time-varying covariates to be included in the analysis. Through simulations, we illustrated that retrospective association tests are robust to ascertainment and other types of phenotype model misspecification, and gain power over previous association methods. We applied L-BRAT and RGMMAT to a genome-wide association analysis of repeated measures of cocaine use in a longitudinal cohort. Pathway analysis implicated association with opioid signaling and axonal guidance signaling pathways. Lastly, we replicated important pathways in an independent cocaine dependence case-control GWAS. Our results illustrate that L-BRAT is able to detect important loci and pathways in a genome scan and to provide insights into genetic architecture of cocaine use.

Download Full-text