Testing for association with rare variants in the coding and non-coding genome: RAVA-FIRST, a new approach based on CADD deleteriousness score

Rare variant association tests (RVAT) have been developed to study the contribution of rare variants widely accessible through high-throughput sequencing technologies. RVAT require to aggregate rare variants in testing units and to filter variants to retain only the most likely causal ones. In the exome, genes are natural testing units and variants are usually filtered based on their functional consequences. However, when dealing with whole-genome sequence (WGS) data, both steps are challenging. No natural biological unit is available for aggregating rare variants. Sliding windows procedures have been proposed to circumvent this difficulty, however they are blind to biological information and result in a large number of tests. We propose a new strategy to perform RVAT on WGS data: “RAVA-FIRST” (RAre Variant Association using Functionally-InfoRmed STeps) comprising three steps. (1) New testing units are defined genome-wide based on functionally-adjusted Combined Annotation Dependent Depletion (CADD) scores of variants observed in the GnomAD populations, which are referred to as “CADD regions”. (2) A region-dependent filtering of rare variants is applied in each CADD region. (3) A functionally-informed burden test is performed with sub-scores computed for each genomic category within each CADD region. Both on simulations and real data, RAVA-FIRST was found to outperform other WGS-based RVAT. Applied to a WGS dataset of venous thromboembolism patients, we identified an intergenic region on chromosome 18 that is enriched for rare variants in early-onset patients and that was that was missed by standard sliding windows procedures. RAVA-FIRST enables new investigations of rare non-coding variants in complex diseases, facilitated by its implementation in the R package Ravages.

Download Full-text

RAREsim: A simulation method for very rare genetic variants

10.1101/2021.04.13.439644 ◽

2021 ◽

Author(s):

Megan Null ◽

Josée Dupuis ◽

Christopher R. Gignoux ◽

Audrey E. Hendricks

Keyword(s):

Rare Variant ◽

Complex Traits ◽

Rare Variants ◽

Simulated Data ◽

Real Data ◽

Simulation Method ◽

Sequencing Data ◽

Variant Annotation ◽

Causal Variants ◽

Rare Genetic Variants

AbstractIdentification of rare variant associations is crucial to fully characterize the genetic architecture of complex traits and diseases. Essential in this process is the evaluation of novel methods in simulated data that mirrors the distribution of rare variants and haplotype structure in real data. Additionally, importing real variant annotation enables in silico comparison of methods that focus on putative causal variants, such as rare variant association tests, and polygenic scoring methods. Existing simulation methods are either unable to employ real variant annotation or severely under- or over-estimate the number of singletons and doubletons reducing the ability to generalize simulation results to real studies. We present RAREsim, a flexible and accurate rare variant simulation algorithm. Using parameters and haplotypes derived from real sequencing data, RAREsim efficiently simulates the expected variant distribution and enables real variant annotations. We highlight RAREsim’s utility across various genetic regions, sample sizes, ancestries, and variant classes.

Download Full-text

Dynamic Scan Procedure for Detecting Rare-Variant Association Regions in Whole Genome Sequencing Studies

10.1101/552950 ◽

2019 ◽

Author(s):

Zilin Li ◽

Xihao Li ◽

Yaowu Liu ◽

Jincheng Shen ◽

Han Chen ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Rare Variant ◽

Type I Error ◽

Rare Variants ◽

Error Rates ◽

Type I ◽

Whole Genome ◽

Rare Variant Association ◽

Dynamic Scan

AbstractWhole genome sequencing (WGS) studies are being widely conducted to identify rare variants associated with human diseases and disease-related traits. Classical single-marker association analyses for rare variants have limited power, and variant-set based analyses are commonly used to analyze rare variants. However, existing variant-set based approaches need to pre-specify genetic regions for analysis, and hence are not directly applicable to WGS data due to the large number of intergenic and intron regions that consist of a massive number of non-coding variants. The commonly used sliding window method requires pre-specifying fixed window sizes, which are often unknown as a priori, are difficult to specify in practice and are subject to limitations given genetic association region sizes are likely to vary across the genome and phenotypes. We propose a computationally-efficient and dynamic scan statistic method (Scan the Genome (SCANG)) for analyzing WGS data that flexibly detects the sizes and the locations of rare-variants association regions without the need of specifying a prior fixed window size. The proposed method controls the genome-wise type I error rate and accounts for the linkage disequilibrium among genetic variants. It allows the detected rare variants association region sizes to vary across the genome. Through extensive simulated studies that consider a wide variety of scenarios, we show that SCANG substantially outperforms several alternative rare-variant association detection methods while controlling for the genome-wise type I error rates. We illustrate SCANG by analyzing the WGS lipids data from the Atherosclerosis Risk in Communities (ARIC) study.

Download Full-text

Efficient and Flexible Integration of Variant Characteristics in Rare Variant Association Studies Using Integrated Nested Laplace Approximation

10.1101/2020.03.12.988584 ◽

2020 ◽

Author(s):

Hana Susak ◽

Laura Serra-Saurina ◽

Raquel Rabionet Janssen ◽

Laura Domènech ◽

Mattia Bosio ◽

...

Keyword(s):

Statistical Methods ◽

Genome Sequencing ◽

Rare Variant ◽

Rare Variants ◽

Association Studies ◽

Complex Diseases ◽

Phenotypic Variance ◽

Rare Variant Association ◽

Genome Wide ◽

Whole Exome

AbstractRare variants are thought to play an important role in the etiology of complex diseases and may explain a significant fraction of the missing heritability in genetic disease studies. Next-generation sequencing facilitates the association of rare variants in coding or regulatory regions with complex diseases in large cohorts at genome-wide scale. However, rare variant association studies (RVAS) still lack power when cohorts are small to medium-sized and if genetic variation explains a small fraction of phenotypic variance. Here we present a novel Bayesian rare variant Association Test using Integrated Nested Laplace Approximation (BATI). Unlike existing RVAS tests, BATI allows integration of individual or variant-specific features as covariates, while efficiently performing inference based on full model estimation. We demonstrate that BATI outperforms established RVAS methods on realistic, semi-synthetic whole-exome sequencing cohorts, especially when using meaningful biological context, such as functional annotation. We show that BATI achieves power above 75% in scenarios in which competing tests fail to identify risk genes, e.g. when risk variants in sum explain less than 0.5% of phenotypic variance. We have integrated BATI, together with five existing RVAS tests in the ‘Rare Variant Genome Wide Association Study’ (rvGWAS) framework for data analyzed by whole-exome or whole genome sequencing. rvGWAS supports rare variant association for genes or any other biological unit such as promoters, while allowing the analysis of essential functionalities like quality control or filtering. Applying rvGWAS to a Chronic Lymphocytic Leukemia study we identified eight candidate predisposition genes, including EHMT2 and COPS7A.Data availability and implementationAll relevant data are within the manuscript and pipeline implementation on https://github.com/hanasusak/rvGWASAuthor summaryComplex diseases are characterized by being related to genetic factors and environmental factors such as air pollution, diet etc. that together define the susceptibility of each individual to develop a given disease. Much effort has been applied to advance the knowledge of the genetic bases of such diseases, specially in the discovery of frequent genetic variants in the population increasing disease risk. However, these variants usually explain a little part of the etiology of such diseases. Previous studies have shown that rare variants, i.e. variants present in less than 1% of the population, may explain the rest of the variability related to genetic aspects of the disease.Genome sequencing offers the opportunity to discover rare variants, but powerful statistical methods are needed to discriminate those variants that induce susceptibility to the disease. Here we have developed a powerful and flexible statistical approach for the detection of rare variants associated with a disease and we have integrated it into a computer tool that is easy and intuitive for the researchers and clinicians to use. We have shown that our approach outperformed other common statistical methods specially in a situation where these variants explain just a small part of the disease. The discovery of these rare variants will contribute to the knowledge of the molecular mechanism of complex diseases.

Download Full-text

Asymmetric Inheritance of Cell Fate Determinants: Focus on RNA

Non-Coding RNA ◽

10.3390/ncrna5020038 ◽

2019 ◽

Vol 5 (2) ◽

pp. 38 ◽

Cited By ~ 7

Author(s):

Yelyzaveta Shlyakhtina ◽

Katherine L. Moran ◽

Maximiliano M. Portal

Keyword(s):

Cell Fate ◽

Mammalian Cells ◽

High Throughput Sequencing ◽

General Pattern ◽

Nuclear Architecture ◽

Biological Information ◽

Current Evidence ◽

Rna Molecules ◽

Sequencing Technologies ◽

Wide Range

During the last decade, and mainly primed by major developments in high-throughput sequencing technologies, the catalogue of RNA molecules harbouring regulatory functions has increased at a steady pace. Current evidence indicates that hundreds of mammalian RNAs have regulatory roles at several levels, including transcription, translation/post-translation, chromatin structure, and nuclear architecture, thus suggesting that RNA molecules are indeed mighty controllers in the flow of biological information. Therefore, it is logical to suggest that there must exist a series of molecular systems that safeguard the faithful inheritance of RNA content throughout cell division and that those mechanisms must be tightly controlled to ensure the successful segregation of key molecules to the progeny. Interestingly, whilst a handful of integral components of mammalian cells seem to follow a general pattern of asymmetric inheritance throughout division, the fate of RNA molecules largely remains a mystery. Herein, we will discuss current concepts of asymmetric inheritance in a wide range of systems, including prions, proteins, and finally RNA molecules, to assess overall the biological impact of RNA inheritance in cellular plasticity and evolutionary fitness.

Download Full-text

A permutation method for detecting trend correlations in rare variant association studies

Genetics Research ◽

10.1017/s0016672319000120 ◽

2019 ◽

Vol 101 ◽

Author(s):

Lifeng Liu ◽

Pengfei Wang ◽

Jingbo Meng ◽

Lili Chen ◽

Wensheng Zhu ◽

...

Keyword(s):

Rare Variant ◽

Type I Error ◽

Rare Variants ◽

Association Studies ◽

Complex Diseases ◽

Type I ◽

Phenotypic Variance ◽

Rare Variant Association ◽

Significance Level ◽

Association Analyses

Abstract In recent years, there has been an increasing interest in detecting disease-related rare variants in sequencing studies. Numerous studies have shown that common variants can only explain a small proportion of the phenotypic variance for complex diseases. More and more evidence suggests that some of this missing heritability can be explained by rare variants. Considering the importance of rare variants, researchers have proposed a considerable number of methods for identifying the rare variants associated with complex diseases. Extensive research has been carried out on testing the association between rare variants and dichotomous, continuous or ordinal traits. So far, however, there has been little discussion about the case in which both genotypes and phenotypes are ordinal variables. This paper introduces a method based on the γ-statistic, called OV-RV, for examining disease-related rare variants when both genotypes and phenotypes are ordinal. At present, little is known about the asymptotic distribution of the γ-statistic when conducting association analyses for rare variants. One advantage of OV-RV is that it provides a robust estimation of the distribution of the γ-statistic by employing the permutation approach proposed by Fisher. We also perform extensive simulations to investigate the numerical performance of OV-RV under various model settings. The simulation results reveal that OV-RV is valid and efficient; namely, it controls the type I error approximately at the pre-specified significance level and achieves greater power at the same significance level. We also apply OV-RV for rare variant association studies of diastolic blood pressure.

Download Full-text

GALLO: An R package for genomic annotation and integration of multiple data sources in livestock for positional candidate loci

GigaScience ◽

10.1093/gigascience/giaa149 ◽

2020 ◽

Vol 9 (12) ◽

Author(s):

Pablo A S Fonseca ◽

Aroa Suárez-Vega ◽

Gabriele Marras ◽

Ángela Cánovas

Keyword(s):

Complex Traits ◽

High Throughput Sequencing ◽

Association Studies ◽

R Package ◽

Biological Information ◽

Genome Wide Association Studies ◽

Multiple Sources ◽

Positional Candidate ◽

Genomic Annotation ◽

Candidate Loci

Abstract Background The development of high-throughput sequencing and genotyping methodologies has enabled the identification of thousands of genomic regions associated with several complex traits. The integration of multiple sources of biological information is a crucial step required to better understand patterns regulating the development of these traits. Findings Genomic Annotation in Livestock for positional candidate LOci (GALLO) is an R package developed for the accurate annotation of genes and quantitative trait loci (QTLs) located in regions identified in common genomic analyses performed in livestock, such as genome-wide association studies and transcriptomics using RNA sequencing. Moreover, GALLO allows the graphical visualization of gene and QTL annotation results, data comparison among different grouping factors (e.g., methods, breeds, tissues, statistical models, studies), and QTL enrichment in different livestock species such as cattle, pigs, sheep, and chickens. Conclusions Consequently, GALLO is a useful package for annotation, identification of hidden patterns across datasets, and data mining previously reported associations, as well as the efficient examination of the genetic architecture of complex traits in livestock.

Download Full-text

Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences

F1000Research ◽

10.12688/f1000research.7563.2 ◽

2016 ◽

Vol 4 ◽

pp. 1521 ◽

Cited By ~ 268

Author(s):

Charlotte Soneson ◽

Michael I. Love ◽

Mark D. Robinson

Keyword(s):

Statistical Inference ◽

High Throughput Sequencing ◽

Real Data ◽

Transcript Level ◽

R Package ◽

Data Sets ◽

Rna Seq ◽

Abundance Estimates ◽

Gene Level ◽

Genomic Regions

High-throughput sequencing of cDNA (RNA-seq) is used extensively to characterize the transcriptome of cells. Many transcriptomic studies aim at comparing either abundance levels or the transcriptome composition between given conditions, and as a first step, the sequencing reads must be used as the basis for abundance quantification of transcriptomic features of interest, such as genes or transcripts. Various quantification approaches have been proposed, ranging from simple counting of reads that overlap given genomic regions to more complex estimation of underlying transcript abundances. In this paper, we show that gene-level abundance estimates and statistical inference offer advantages over transcript-level analyses, in terms of performance and interpretability. We also illustrate that the presence of differential isoform usage can lead to inflated false discovery rates in differential gene expression analyses on simple count matrices but that this can be addressed by incorporating offsets derived from transcript-level abundance estimates. We also show that the problem is relatively minor in several real data sets. Finally, we provide an R package (tximport) to help users integrate transcript-level abundance estimates from common quantification pipelines into count-based statistical inference engines.

Download Full-text

Bayesian model comparison for rare variant association studies of multiple phenotypes

10.1101/257162 ◽

2018 ◽

Cited By ~ 3

Author(s):

Christopher DeBoever ◽

Matthew Aguirre ◽

Yosuke Tanigawa ◽

Chris C. A. Spencer ◽

Timothy Poterba ◽

...

Keyword(s):

Genetic Variation ◽

Rare Variant ◽

Genetic Variants ◽

Model Comparison ◽

Rare Variants ◽

Association Studies ◽

Meta Analysis ◽

Rare Variant Association ◽

Physical Measurements ◽

Comparison Approach

AbstractWhole genome sequencing studies applied to large populations or biobanks with extensive phenotyping raise new analytic challenges. The need to consider many variants at a locus or group of genes simultaneously and the potential to study many correlated phenotypes with shared genetic architecture provide opportunities for discovery and inference that are not addressed by the traditional one variant-one phenotype association study. Here we introduce a model comparison approach we refer to as MRP for rare variant association studies that considers correlation, scale, and location of genetic effects across a group of genetic variants, phenotypes, and studies. We consider the use of summary statistic data to apply univariate and multivariate gene-based meta-analysis models for identifying rare variant associations with an emphasis on protective protein-truncating variants that can expedite drug discovery. Through simulation studies, we demonstrate that the proposed model comparison approach can improve ability to detect rare variant association signals. We also apply the model to two groups of phenotypes from the UK Biobank: 1) asthma diagnosis, eosinophil counts, forced expiratory volume, and forced vital capacity; and 2) glaucoma diagnosis, intra-ocular pressure, and corneal resistance factor. We are able to recover known associations such as the protective association between rs146597587 in IL33 and asthma. We also find evidence for novel protective associations between rare variants in ANGPTL7 and glaucoma. Overall, we show that the MRP model comparison approach is able to retain and improve upon useful features from widely-used meta-analysis approaches for rare variant association analyses and prioritize protective modifiers of disease risk.Author summaryDue to the continually decreasing cost of acquiring genetic data, we are now beginning to see large collections of individuals for which we have both genetic information and trait data such as disease status, physical measurements, biomarker levels, and more. These datasets offer new opportunities to find relationships between inherited genetic variation and disease. While it is known that there are relationships between different traits, typical genetic analyses only focus on analyzing one genetic variant and one phenotype at a time. Additionally, it is difficult to identify rare genetic variants that are associated with disease due to their scarcity, even among large sample sizes. In this work, we present a method for identifying associations between genetic variation and disease that considers multiple rare variants and phenotypes at the same time. By sharing information across rare variant and phenotypes, we improve our ability to identify rare variants associated with disease compared to considering a single rare variant and a single phenotype. The method can be used to identify candidate disease genes as well as genes that might represent attractive drug targets.

Download Full-text

Efficient and flexible Integration of variant characteristics in rare variant association studies using integrated nested Laplace approximation

PLoS Computational Biology ◽

10.1371/journal.pcbi.1007784 ◽

2021 ◽

Vol 17 (2) ◽

pp. e1007784

Author(s):

Hana Susak ◽

Laura Serra-Saurina ◽

German Demidov ◽

Raquel Rabionet ◽

Laura Domènech ◽

...

Keyword(s):

Rare Variant ◽

Rare Variants ◽

Association Studies ◽

Complex Diseases ◽

Laplace Approximation ◽

Phenotypic Variance ◽

Rare Variant Association ◽

Integrated Nested Laplace Approximation ◽

Genome Wide ◽

Whole Exome

Rare variants are thought to play an important role in the etiology of complex diseases and may explain a significant fraction of the missing heritability in genetic disease studies. Next-generation sequencing facilitates the association of rare variants in coding or regulatory regions with complex diseases in large cohorts at genome-wide scale. However, rare variant association studies (RVAS) still lack power when cohorts are small to medium-sized and if genetic variation explains a small fraction of phenotypic variance. Here we present a novel Bayesian rare variant Association Test using Integrated Nested Laplace Approximation (BATI). Unlike existing RVAS tests, BATI allows integration of individual or variant-specific features as covariates, while efficiently performing inference based on full model estimation. We demonstrate that BATI outperforms established RVAS methods on realistic, semi-synthetic whole-exome sequencing cohorts, especially when using meaningful biological context, such as functional annotation. We show that BATI achieves power above 70% in scenarios in which competing tests fail to identify risk genes, e.g. when risk variants in sum explain less than 0.5% of phenotypic variance. We have integrated BATI, together with five existing RVAS tests in the ‘Rare Variant Genome Wide Association Study’ (rvGWAS) framework for data analyzed by whole-exome or whole genome sequencing. rvGWAS supports rare variant association for genes or any other biological unit such as promoters, while allowing the analysis of essential functionalities like quality control or filtering. Applying rvGWAS to a Chronic Lymphocytic Leukemia study we identified eight candidate predisposition genes, including EHMT2 and COPS7A.

Download Full-text

A broken promise: microbiome differential abundance methods do not control the false discovery rate

Briefings in Bioinformatics ◽

10.1093/bib/bbx104 ◽

2017 ◽

Vol 20 (1) ◽

pp. 210-221 ◽

Cited By ~ 31

Author(s):

Stijn Hawinkel ◽

Federico Mattiello ◽

Luc Bijnens ◽

Olivier Thas

Keyword(s):

Statistical Methods ◽

High Throughput Sequencing ◽

Bacterial Species ◽

Human Microbiome ◽

Real Data ◽

Differential Abundance ◽

Sequencing Technologies ◽

Antibiotic Drugs ◽

Count Distribution ◽

Microbiome Data

Abstract High-throughput sequencing technologies allow easy characterization of the human microbiome, but the statistical methods to analyze microbiome data are still in their infancy. Differential abundance methods aim at detecting associations between the abundances of bacterial species and subject grouping factors. The results of such methods are important to identify the microbiome as a prognostic or diagnostic biomarker or to demonstrate efficacy of prodrug or antibiotic drugs. Because of a lack of benchmarking studies in the microbiome field, no consensus exists on the performance of the statistical methods. We have compared a large number of popular methods through extensive parametric and nonparametric simulation as well as real data shuffling algorithms. The results are consistent over the different approaches and all point to an alarming excess of false discoveries. This raises great doubts about the reliability of discoveries in past studies and imperils reproducibility of microbiome experiments. To further improve method benchmarking, we introduce a new simulation tool that allows to generate correlated count data following any univariate count distribution; the correlation structure may be inferred from real data. Most simulation studies discard the correlation between species, but our results indicate that this correlation can negatively affect the performance of statistical methods.

Download Full-text