A permutation method for detecting trend correlations in rare variant association studies

Abstract In recent years, there has been an increasing interest in detecting disease-related rare variants in sequencing studies. Numerous studies have shown that common variants can only explain a small proportion of the phenotypic variance for complex diseases. More and more evidence suggests that some of this missing heritability can be explained by rare variants. Considering the importance of rare variants, researchers have proposed a considerable number of methods for identifying the rare variants associated with complex diseases. Extensive research has been carried out on testing the association between rare variants and dichotomous, continuous or ordinal traits. So far, however, there has been little discussion about the case in which both genotypes and phenotypes are ordinal variables. This paper introduces a method based on the γ-statistic, called OV-RV, for examining disease-related rare variants when both genotypes and phenotypes are ordinal. At present, little is known about the asymptotic distribution of the γ-statistic when conducting association analyses for rare variants. One advantage of OV-RV is that it provides a robust estimation of the distribution of the γ-statistic by employing the permutation approach proposed by Fisher. We also perform extensive simulations to investigate the numerical performance of OV-RV under various model settings. The simulation results reveal that OV-RV is valid and efficient; namely, it controls the type I error approximately at the pre-specified significance level and achieves greater power at the same significance level. We also apply OV-RV for rare variant association studies of diastolic blood pressure.

Download Full-text

Efficient and Flexible Integration of Variant Characteristics in Rare Variant Association Studies Using Integrated Nested Laplace Approximation

10.1101/2020.03.12.988584 ◽

2020 ◽

Author(s):

Hana Susak ◽

Laura Serra-Saurina ◽

Raquel Rabionet Janssen ◽

Laura Domènech ◽

Mattia Bosio ◽

...

Keyword(s):

Statistical Methods ◽

Genome Sequencing ◽

Rare Variant ◽

Rare Variants ◽

Association Studies ◽

Complex Diseases ◽

Phenotypic Variance ◽

Rare Variant Association ◽

Genome Wide ◽

Whole Exome

AbstractRare variants are thought to play an important role in the etiology of complex diseases and may explain a significant fraction of the missing heritability in genetic disease studies. Next-generation sequencing facilitates the association of rare variants in coding or regulatory regions with complex diseases in large cohorts at genome-wide scale. However, rare variant association studies (RVAS) still lack power when cohorts are small to medium-sized and if genetic variation explains a small fraction of phenotypic variance. Here we present a novel Bayesian rare variant Association Test using Integrated Nested Laplace Approximation (BATI). Unlike existing RVAS tests, BATI allows integration of individual or variant-specific features as covariates, while efficiently performing inference based on full model estimation. We demonstrate that BATI outperforms established RVAS methods on realistic, semi-synthetic whole-exome sequencing cohorts, especially when using meaningful biological context, such as functional annotation. We show that BATI achieves power above 75% in scenarios in which competing tests fail to identify risk genes, e.g. when risk variants in sum explain less than 0.5% of phenotypic variance. We have integrated BATI, together with five existing RVAS tests in the ‘Rare Variant Genome Wide Association Study’ (rvGWAS) framework for data analyzed by whole-exome or whole genome sequencing. rvGWAS supports rare variant association for genes or any other biological unit such as promoters, while allowing the analysis of essential functionalities like quality control or filtering. Applying rvGWAS to a Chronic Lymphocytic Leukemia study we identified eight candidate predisposition genes, including EHMT2 and COPS7A.Data availability and implementationAll relevant data are within the manuscript and pipeline implementation on https://github.com/hanasusak/rvGWASAuthor summaryComplex diseases are characterized by being related to genetic factors and environmental factors such as air pollution, diet etc. that together define the susceptibility of each individual to develop a given disease. Much effort has been applied to advance the knowledge of the genetic bases of such diseases, specially in the discovery of frequent genetic variants in the population increasing disease risk. However, these variants usually explain a little part of the etiology of such diseases. Previous studies have shown that rare variants, i.e. variants present in less than 1% of the population, may explain the rest of the variability related to genetic aspects of the disease.Genome sequencing offers the opportunity to discover rare variants, but powerful statistical methods are needed to discriminate those variants that induce susceptibility to the disease. Here we have developed a powerful and flexible statistical approach for the detection of rare variants associated with a disease and we have integrated it into a computer tool that is easy and intuitive for the researchers and clinicians to use. We have shown that our approach outperformed other common statistical methods specially in a situation where these variants explain just a small part of the disease. The discovery of these rare variants will contribute to the knowledge of the molecular mechanism of complex diseases.

Download Full-text

Efficient and flexible Integration of variant characteristics in rare variant association studies using integrated nested Laplace approximation

PLoS Computational Biology ◽

10.1371/journal.pcbi.1007784 ◽

2021 ◽

Vol 17 (2) ◽

pp. e1007784

Author(s):

Hana Susak ◽

Laura Serra-Saurina ◽

German Demidov ◽

Raquel Rabionet ◽

Laura Domènech ◽

...

Keyword(s):

Rare Variant ◽

Rare Variants ◽

Association Studies ◽

Complex Diseases ◽

Laplace Approximation ◽

Phenotypic Variance ◽

Rare Variant Association ◽

Integrated Nested Laplace Approximation ◽

Genome Wide ◽

Whole Exome

Rare variants are thought to play an important role in the etiology of complex diseases and may explain a significant fraction of the missing heritability in genetic disease studies. Next-generation sequencing facilitates the association of rare variants in coding or regulatory regions with complex diseases in large cohorts at genome-wide scale. However, rare variant association studies (RVAS) still lack power when cohorts are small to medium-sized and if genetic variation explains a small fraction of phenotypic variance. Here we present a novel Bayesian rare variant Association Test using Integrated Nested Laplace Approximation (BATI). Unlike existing RVAS tests, BATI allows integration of individual or variant-specific features as covariates, while efficiently performing inference based on full model estimation. We demonstrate that BATI outperforms established RVAS methods on realistic, semi-synthetic whole-exome sequencing cohorts, especially when using meaningful biological context, such as functional annotation. We show that BATI achieves power above 70% in scenarios in which competing tests fail to identify risk genes, e.g. when risk variants in sum explain less than 0.5% of phenotypic variance. We have integrated BATI, together with five existing RVAS tests in the ‘Rare Variant Genome Wide Association Study’ (rvGWAS) framework for data analyzed by whole-exome or whole genome sequencing. rvGWAS supports rare variant association for genes or any other biological unit such as promoters, while allowing the analysis of essential functionalities like quality control or filtering. Applying rvGWAS to a Chronic Lymphocytic Leukemia study we identified eight candidate predisposition genes, including EHMT2 and COPS7A.

Download Full-text

Dynamic Scan Procedure for Detecting Rare-Variant Association Regions in Whole Genome Sequencing Studies

10.1101/552950 ◽

2019 ◽

Author(s):

Zilin Li ◽

Xihao Li ◽

Yaowu Liu ◽

Jincheng Shen ◽

Han Chen ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Rare Variant ◽

Type I Error ◽

Rare Variants ◽

Error Rates ◽

Type I ◽

Whole Genome ◽

Rare Variant Association ◽

Dynamic Scan

AbstractWhole genome sequencing (WGS) studies are being widely conducted to identify rare variants associated with human diseases and disease-related traits. Classical single-marker association analyses for rare variants have limited power, and variant-set based analyses are commonly used to analyze rare variants. However, existing variant-set based approaches need to pre-specify genetic regions for analysis, and hence are not directly applicable to WGS data due to the large number of intergenic and intron regions that consist of a massive number of non-coding variants. The commonly used sliding window method requires pre-specifying fixed window sizes, which are often unknown as a priori, are difficult to specify in practice and are subject to limitations given genetic association region sizes are likely to vary across the genome and phenotypes. We propose a computationally-efficient and dynamic scan statistic method (Scan the Genome (SCANG)) for analyzing WGS data that flexibly detects the sizes and the locations of rare-variants association regions without the need of specifying a prior fixed window size. The proposed method controls the genome-wise type I error rate and accounts for the linkage disequilibrium among genetic variants. It allows the detected rare variants association region sizes to vary across the genome. Through extensive simulated studies that consider a wide variety of scenarios, we show that SCANG substantially outperforms several alternative rare-variant association detection methods while controlling for the genome-wise type I error rates. We illustrate SCANG by analyzing the WGS lipids data from the Atherosclerosis Risk in Communities (ARIC) study.

Download Full-text

Taking population stratification into account by local permutations in rare-variant association studies on small samples

10.1101/2020.01.29.924977 ◽

2020 ◽

Cited By ~ 1

Author(s):

J. Mullaert ◽

M. Bouaziz ◽

Y. Seeleuthner ◽

B. Bigio ◽

J-L. Casanova ◽

...

Keyword(s):

Sample Size ◽

Rare Variant ◽

Population Stratification ◽

Type I Error ◽

Small Sample Size ◽

Association Studies ◽

Small Sample ◽

Small Samples ◽

Type I ◽

Rare Variant Association

AbstractMany methods for rare variant association studies require permutations to assess the significance of tests. Standard permutations assume that all individuals are exchangeable and do not take population stratification (PS), a known confounding factor in genetic studies, into account. We propose a novel strategy, LocPerm, in which individuals are permuted only with their closest ancestry-based neighbors. We performed a simulation study, focusing on small samples, to evaluate and compare LocPerm with standard permutations and classical adjustment on first principal components. Under the null hypothesis, LocPerm was the only method providing an acceptable type I error, regardless of sample size and level of stratification. The power of LocPerm was similar to that of standard permutation in the absence of PS, and remained stable in different PS scenarios. We conclude that LocPerm is a method of choice for taking PS and/or small sample size into account in rare variant association studies.

Download Full-text

Controlling for human population stratification in rare variant association studies

Scientific Reports ◽

10.1038/s41598-021-98370-5 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Matthieu Bouaziz ◽

Jimmy Mullaert ◽

Benedetta Bigio ◽

Yoann Seeleuthner ◽

Jean-Laurent Casanova ◽

...

Keyword(s):

Population Stratification ◽

Type I Error ◽

Rare Variants ◽

Association Studies ◽

Genetic Association Studies ◽

Type I ◽

Sample Sizes ◽

Rare Disorders ◽

Type I Errors ◽

Large Numbers

AbstractPopulation stratification is a confounder of genetic association studies. In analyses of rare variants, corrections based on principal components (PCs) and linear mixed models (LMMs) yield conflicting conclusions. Studies evaluating these approaches generally focused on limited types of structure and large sample sizes. We investigated the properties of several correction methods through a large simulation study using real exome data, and several within- and between-continent stratification scenarios. We considered different sample sizes, with situations including as few as 50 cases, to account for the analysis of rare disorders. Large samples showed that accounting for stratification was more difficult with a continental than with a worldwide structure. When considering a sample of 50 cases, an inflation of type-I-errors was observed with PCs for small numbers of controls (≤ 100), and with LMMs for large numbers of controls (≥ 1000). We also tested a novel local permutation method (LocPerm), which maintained a correct type-I-error in all situations. Powers were equivalent for all approaches pointing out that the key issue is to properly control type-I-errors. Finally, we found that power of analyses including small numbers of cases can be increased, by adding a large panel of external controls, provided an appropriate stratification correction was used.

Download Full-text

Integrative analysis of sequencing and array genotype data for discovering disease associations with rare mutations

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1406143112 ◽

2015 ◽

Vol 112 (4) ◽

pp. 1019-1024 ◽

Cited By ~ 11

Author(s):

Yi-Juan Hu ◽

Yun Li ◽

Paul L. Auer ◽

Dan-Yu Lin

Keyword(s):

Type I Error ◽

Rare Variants ◽

Extreme Values ◽

Association Studies ◽

Cost Effective ◽

Type I ◽

Genome Wide Association Studies ◽

Score Statistic ◽

Sequencing Data ◽

Association Tests

In the large cohorts that have been used for genome-wide association studies (GWAS), it is prohibitively expensive to sequence all cohort members. A cost-effective strategy is to sequence subjects with extreme values of quantitative traits or those with specific diseases. By imputing the sequencing data from the GWAS data for the cohort members who are not selected for sequencing, one can dramatically increase the number of subjects with information on rare variants. However, ignoring the uncertainties of imputed rare variants in downstream association analysis will inflate the type I error when sequenced subjects are not a random subset of the GWAS subjects. In this article, we provide a valid and efficient approach to combining observed and imputed data on rare variants. We consider commonly used gene-level association tests, all of which are constructed from the score statistic for assessing the effects of individual variants on the trait of interest. We show that the score statistic based on the observed genotypes for sequenced subjects and the imputed genotypes for nonsequenced subjects is unbiased. We derive a robust variance estimator that reflects the true variability of the score statistic regardless of the sampling scheme and imputation quality, such that the corresponding association tests always have correct type I error. We demonstrate through extensive simulation studies that the proposed tests are substantially more powerful than the use of accurately imputed variants only and the use of sequencing data alone. We provide an application to the Women’s Health Initiative. The relevant software is freely available.

Download Full-text

An evaluation of approaches for rare variant association analyses of binary traits in related samples

Scientific Reports ◽

10.1038/s41598-021-82547-z ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Ming-Huei Chen ◽

Achilleas Pitsillides ◽

Qiong Yang

Keyword(s):

Logistic Regression ◽

Rare Variants ◽

Association Studies ◽

Family Relationship ◽

Genetic Association Studies ◽

Error Rates ◽

Ratio Test ◽

Type I ◽

Association Analyses ◽

Binary Traits

AbstractRecognizing that family data provide unique advantage of identifying rare risk variants in genetic association studies, many cohorts with related samples have gone through whole genome sequencing in large initiatives such as the NHLBI Trans-Omics for Precision Medicine (TOPMed) program. Analyzing rare variants poses challenges for binary traits in that some genotype categories may have few or no observed events, causing bias and inflation in commonly used methods. Several methods have recently been proposed to better handle rare variants while accounting for family relationship, but their performances have not been thoroughly evaluated together. Here we compare several existing approaches including SAIGE but not limited to related samples using simulations based on the Framingham Heart Study samples and genotype data from Illumina HumanExome BeadChip where rare variants are the majority. We found that logistic regression with likelihood ratio test applied to related samples was the only approach that did not have inflated type I error rates in both single variant test (SVT) and gene-based tests, followed by Firth logistic regression that had inflation in its direction insensitive gene-based test at prevalence 0.01 only, applied to either related or unrelated samples, though theoretically logistic regression and Firth logistic regression do not account for relatedness in samples. SAIGE had inflation in SVT at prevalence 0.1 or lower and the inflation was eliminated with a minor allele count filter of 5. As for power, there was no approach that outperformed others consistently among all single variant tests and gene-based tests.

Download Full-text

Rare variant enriched identity-by-descent enables the detection of distant relatedness and older divergence between populations

10.1101/2020.05.05.079541 ◽

2020 ◽

Author(s):

Amol C. Shetty ◽

Jeffrey O’Connell ◽

Braxton D. Mitchell ◽

Timothy D. O’Connor ◽

◽

...

Keyword(s):

Rare Variant ◽

Human Population ◽

Large Scale ◽

Genetic Relatedness ◽

Rare Variants ◽

Association Studies ◽

Common Variants ◽

Identity By Descent ◽

Association Analyses ◽

Scale Population

AbstractMotivationThe global human population has experienced an explosive growth from a few million to roughly 7 billion people in the last 10,000 years. Accompanying this growth has been the accumulation of rare variants that can inform our understanding of human evolutionary history. Common variants have primarily been used to infer the structure of the human population and relatedness between two individuals. However, with the increasing abundance of rare variants observed in large-scale projects, such as Trans-Omics for Precision Medicine (TOPMed), the use of rare variants to decipher cryptic relatedness and fine-scale population structure can be beneficial to the study of population demographics and association studies. Identity-by-descent (IBD) is an important framework used for identifying these relationships. IBD segments are broken down by recombination over time, such that longer shared haplotypes give strong evidence of recent relatedness while shorter shared haplotypes are indicative of more distant relationships. Current methods to identify IBD accurately detect only long segments (> 2cM) found in related individuals.AlgorithmWe describe a metric that leverages rare-variants shared between individuals to improve the detection of short IBD segments. We computed IBD segments using existing methods implemented in Refined IBD where we enrich the signal using our metric that facilitates the detection of short IBD segments (<2cM) by explicitly incorporating rare variants.ResultsTo test our new metric, we simulated datasets involving populations with varying divergent time-scales. We show that rare-variant IBD identifies shorter segments with greater confidence and enables the detection of older divergence between populations. As an example, we applied our metric to the Old-Order Amish cohort with known genealogies dating 14 generations back to validate its ability to detect genetic relatedness between distant relatives. This analysis shows that our method increases the accuracy of identifying shorter segments that in turn capture distant relationships.ConclusionsWe describe a method to enrich the detection of short IBD segments using rare-variant sharing within IBD segments. Leveraging rare-variant sharing improves the information content of short IBD segments better than common variants alone. We validated the method in both simulated and empirical datasets. This method can benefit association analyses, IBD mapping analyses, and demographic inferences.

Download Full-text

Bayesian model comparison for rare variant association studies of multiple phenotypes

10.1101/257162 ◽

2018 ◽

Cited By ~ 3

Author(s):

Christopher DeBoever ◽

Matthew Aguirre ◽

Yosuke Tanigawa ◽

Chris C. A. Spencer ◽

Timothy Poterba ◽

...

Keyword(s):

Genetic Variation ◽

Rare Variant ◽

Genetic Variants ◽

Model Comparison ◽

Rare Variants ◽

Association Studies ◽

Meta Analysis ◽

Rare Variant Association ◽

Physical Measurements ◽

Comparison Approach

AbstractWhole genome sequencing studies applied to large populations or biobanks with extensive phenotyping raise new analytic challenges. The need to consider many variants at a locus or group of genes simultaneously and the potential to study many correlated phenotypes with shared genetic architecture provide opportunities for discovery and inference that are not addressed by the traditional one variant-one phenotype association study. Here we introduce a model comparison approach we refer to as MRP for rare variant association studies that considers correlation, scale, and location of genetic effects across a group of genetic variants, phenotypes, and studies. We consider the use of summary statistic data to apply univariate and multivariate gene-based meta-analysis models for identifying rare variant associations with an emphasis on protective protein-truncating variants that can expedite drug discovery. Through simulation studies, we demonstrate that the proposed model comparison approach can improve ability to detect rare variant association signals. We also apply the model to two groups of phenotypes from the UK Biobank: 1) asthma diagnosis, eosinophil counts, forced expiratory volume, and forced vital capacity; and 2) glaucoma diagnosis, intra-ocular pressure, and corneal resistance factor. We are able to recover known associations such as the protective association between rs146597587 in IL33 and asthma. We also find evidence for novel protective associations between rare variants in ANGPTL7 and glaucoma. Overall, we show that the MRP model comparison approach is able to retain and improve upon useful features from widely-used meta-analysis approaches for rare variant association analyses and prioritize protective modifiers of disease risk.Author summaryDue to the continually decreasing cost of acquiring genetic data, we are now beginning to see large collections of individuals for which we have both genetic information and trait data such as disease status, physical measurements, biomarker levels, and more. These datasets offer new opportunities to find relationships between inherited genetic variation and disease. While it is known that there are relationships between different traits, typical genetic analyses only focus on analyzing one genetic variant and one phenotype at a time. Additionally, it is difficult to identify rare genetic variants that are associated with disease due to their scarcity, even among large sample sizes. In this work, we present a method for identifying associations between genetic variation and disease that considers multiple rare variants and phenotypes at the same time. By sharing information across rare variant and phenotypes, we improve our ability to identify rare variants associated with disease compared to considering a single rare variant and a single phenotype. The method can be used to identify candidate disease genes as well as genes that might represent attractive drug targets.

Download Full-text

Selection and explosive growth may hamper the performance of rare variant association tests

10.1101/015917 ◽

2015 ◽

Cited By ~ 2

Author(s):

Lawrence H. Uricchio ◽

John S. Witte ◽

Ryan D. Hernandez

Keyword(s):

Natural Selection ◽

Rare Variant ◽

Complex Traits ◽

Statistical Power ◽

Rare Variants ◽

Model Parameters ◽

Phenotypic Variance ◽

Additive Variance ◽

Rare Variant Association ◽

Association Tests

Much recent debate has focused on the role of rare variants in complex phenotypes. However, it is well known that rare alleles can only contribute a substantial proportion of the phenotypic variance when they have much larger effect sizes than common variants, which is most easily explained by natural selection constraining trait-altering alleles to low frequency. It is also plausible that demographic events will influence the genetic architecture of complex traits. Unfortunately, most rare variant association tests do not explicitly model natural selection or non-equilibrium demography. Here, we develop a novel evolutionary model of complex traits. We perform numerical calculations and simulate phenotypes under this model using inferred human demographic and selection parameters. We show that rare variants only contribute substantially to complex traits under very strong assumptions about the relationship between effect size and selection strength. We then assess the performance of state-of-the-art rare variant tests using our simulations across a broad range of model parameters. Counterintuitively, we find that statistical power is lowest when rare variants make the greatest contribution to the additive variance, and that power is substantially lower under our model than previously studied models. While many empirical studies have attempted to identify causal loci using rare variant association methods, few have reported novel associations. Some authors have interpreted this to mean that rare variants contribute little to heritability, but our results show that an alternative explanation is that rare variant tests have less power than previously estimated.

Download Full-text