Biobank-scale inference of ancestral recombination graphs enables genealogy-based mixed model association of complex traits

Accurate inference of gene genealogies from genetic data has the potential to facilitate a wide range of analyses. We introduce a method for accurately inferring biobank-scale genome-wide genealogies from sequencing or genotyping array data, as well as strategies to utilize genealogies within linear mixed models to perform association and other complex trait analyses. We use these new methods to build genome-wide genealogies using genotyping data for 337,464 UK Biobank individuals and to detect associations in 7 complex traits. Genealogy-based association detects more rare and ultra-rare signals (N = 133, frequency range 0.0004% - 0.1%) than genotype imputation from ~65,000 sequenced haplotypes (N = 65). In a subset of 138,039 exome sequencing samples, these associations strongly tag (average r = 0.72) underlying sequencing variants, which are enriched for missense (2.3×) and loss-of-function (4.5×) variation. Inferred genealogies also capture additional association signals in higher frequency variants. These results demonstrate that large-scale inference of gene genealogies may be leveraged in the analysis of complex traits, complementing approaches that require the availability of large, population-specific sequencing panels.

Download Full-text

RICOPILI: Rapid Imputation for COnsortias PIpeLIne

10.1101/587196 ◽

2019 ◽

Cited By ~ 7

Author(s):

Max Lam ◽

Swapnil Awasthi ◽

Hunna J. Watson ◽

Jackie Goldstein ◽

Georgia Panagiotaropoulou ◽

...

Keyword(s):

Quality Control ◽

Complex Traits ◽

High Performance ◽

Large Scale ◽

Genome Wide Association Study ◽

Meta Analysis ◽

Supplementary Information ◽

Manuscript Preparation ◽

Genome Wide ◽

Wide Range

AbstractMotivationGenome-wide association study (GWAS) analyses, at sufficient sample sizes and power, have successfully revealed biological insights for several complex traits. RICOPILI, an open sourced Perl-based pipeline was developed to address the challenges of rapidly processing large scale multi-cohort GWAS studies including quality control, imputation and downstream analyses. The pipeline is computationally efficient with portability to a wide range of high-performance computing (HPC) environments.SummaryRICOPILI was created as the Psychiatric Genomics Consortium (PGC) pipeline for GWAS and has been adopted by other users. The pipeline features i) technical and genomic quality control in case-control and trio cohorts ii) genome-wide phasing and imputation iv) association analysis v) meta-analysis vi) polygenic risk scoring and vii) replication analysis. Notably, a major differentiator from other GWAS pipelines, RICOPILI leverages on automated parallelization and cluster job management approaches for rapid production of imputed genome-wide data. A comprehensive meta-analysis of simulated GWAS data has been incorporated demonstrating each step of the pipeline. This includes all of the associated visualization plots, to allow ease of data interpretation and manuscript preparation. Simulated GWAS datasets are also packaged with the pipeline for user training tutorials and developer work.Availability and ImplementationRICOPILI has a flexible architecture to allow for ongoing development and incorporation of newer available algorithms and is adaptable to various HPC environments (QSUB, BSUB, SLURM and others). Specific links for genomic resources are either directly provided in this paper or via tutorials and external links. The central location hosting scripts and tutorials is found at this URL:https://sites.google.com/a/broadinstitute.org/RICOPILI/[email protected] informationSupplementary data are available.

Download Full-text

RICOPILI: Rapid Imputation for COnsortias PIpeLIne

Bioinformatics ◽

10.1093/bioinformatics/btz633 ◽

2019 ◽

Cited By ~ 8

Author(s):

Max Lam ◽

Swapnil Awasthi ◽

Hunna J Watson ◽

Jackie Goldstein ◽

Georgia Panagiotaropoulou ◽

...

Keyword(s):

Complex Traits ◽

High Performance ◽

Large Scale ◽

Genome Wide Association Study ◽

Meta Analysis ◽

Data Interpretation ◽

Supplementary Information ◽

Manuscript Preparation ◽

Genome Wide ◽

Wide Range

Abstract Summary Genome-wide association study (GWAS) analyses, at sufficient sample sizes and power, have successfully revealed biological insights for several complex traits. RICOPILI, an open-sourced Perl-based pipeline was developed to address the challenges of rapidly processing large-scale multi-cohort GWAS studies including quality control (QC), imputation and downstream analyses. The pipeline is computationally efficient with portability to a wide range of high-performance computing environments. RICOPILI was created as the Psychiatric Genomics Consortium pipeline for GWAS and adopted by other users. The pipeline features (i) technical and genomic QC in case-control and trio cohorts, (ii) genome-wide phasing and imputation, (iv) association analysis, (v) meta-analysis, (vi) polygenic risk scoring and (vii) replication analysis. Notably, a major differentiator from other GWAS pipelines, RICOPILI leverages on automated parallelization and cluster job management approaches for rapid production of imputed genome-wide data. A comprehensive meta-analysis of simulated GWAS data has been incorporated demonstrating each step of the pipeline. This includes all the associated visualization plots, to allow ease of data interpretation and manuscript preparation. Simulated GWAS datasets are also packaged with the pipeline for user training tutorials and developer work. Availability and implementation RICOPILI has a flexible architecture to allow for ongoing development and incorporation of newer available algorithms and is adaptable to various HPC environments (QSUB, BSUB, SLURM and others). Specific links for genomic resources are either directly provided in this paper or via tutorials and external links. The central location hosting scripts and tutorials is found at this URL: https://sites.google.com/a/broadinstitute.org/RICOPILI/home Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Better estimation of SNP heritability from summary statistics provides a new understanding of the genetic architecture of complex traits

10.1101/284976 ◽

2018 ◽

Cited By ~ 6

Author(s):

Doug Speed ◽

David J Balding

Keyword(s):

Complex Traits ◽

Genetic Architecture ◽

Large Scale ◽

Association Studies ◽

Genome Wide Association Studies ◽

Summary Statistics ◽

Confounding Bias ◽

Conserved Regions ◽

Genome Wide ◽

Variation Explained

LD Score Regression (LDSC) has been widely applied to the results of genome-wide association studies. However, its estimates of SNP heritability are derived from an unrealistic model in which each SNP is expected to contribute equal heritability. As a consequence, LDSC tends to over-estimate confounding bias, under-estimate the total phenotypic variation explained by SNPs, and provide misleading estimates of the heritability enrichment of SNP categories. Therefore, we present SumHer, software for estimating SNP heritability from summary statistics using more realistic heritability models. After demonstrating its superiority over LDSC, we apply SumHer to the results of 24 large-scale association studies (average sample size 121 000). First we show that these studies have tended to substantially over-correct for confounding, and as a result the number of genome-wide significant loci has under-reported by about 20%. Next we estimate enrichment for 24 categories of SNPs defined by functional annotations. A previous study using LDSC reported that conserved regions were 13-fold enriched, and found a further twelve categories with above 2-fold enrichment. By contrast, our analysis using SumHer finds that conserved regions are only 1.6-fold (SD 0.06) enriched, and that no category has enrichment above 1.7-fold. SumHer provides an improved understanding of the genetic architecture of complex traits, which enables more efficient analysis of future genetic data.

Download Full-text

GWAS-Flow: A GPU accelerated framework for efficient permutation based genome-wide association studies

10.1101/783100 ◽

2019 ◽

Cited By ~ 2

Author(s):

Jan A. Freudenthal ◽

Markus J. Ankenbrand ◽

Dominik G. Grimm ◽

Arthur Korte

Keyword(s):

Complex Traits ◽

Mixed Model ◽

Linear Mixed Model ◽

Association Studies ◽

Large Datasets ◽

Genome Wide Association ◽

Small Data ◽

Genome Wide Association Studies ◽

Genome Wide ◽

Non Gaussian

AbstractMotivationGenome-wide association studies (GWAS) are one of the most commonly used methods to detect associations between complex traits and genomic polymorphisms. As both genotyping and phenotyping of large populations has become easier, typical modern GWAS have to cope with massive amounts of data. Thus, the computational demand for these analyses grew remarkably during the last decades. This is especially true, if one wants to implement permutation-based significance thresholds, instead of using the naïve Bonferroni threshold. Permutation-based methods have the advantage to provide an adjusted multiple hypothesis correction threshold that takes the underlying phenotypic distribution into account and will thus remove the need to find the correct transformation for non Gaussian phenotypes. To enable efficient analyses of large datasets and the possibility to compute permutation-based significance thresholds, we used the machine learning framework TensorFlow to develop a linear mixed model (GWAS-Flow) that can make use of the available CPU or GPU infrastructure to decrease the time of the analyses especially for large datasets.ResultsWe were able to show that our application GWAS-Flow outperforms custom GWAS scripts in terms of speed without loosing accuracy. Apart from p-values, GWAS-Flow also computes summary statistics, such as the effect size and its standard error for each individual marker. The CPU-based version is the default choice for small data, while the GPU-based version of GWAS-Flow is especially suited for the analyses of big data.AvailabilityGWAS-Flow is freely available on GitHub (https://github.com/Joyvalley/GWAS_Flow) and is released under the terms of the MIT-License.

Download Full-text

Animal-ImputeDB: a comprehensive database with multiple animal reference panels for genotype imputation

Nucleic Acids Research ◽

10.1093/nar/gkz854 ◽

2019 ◽

Vol 48 (D1) ◽

pp. D659-D667 ◽

Cited By ~ 2

Author(s):

Wenqian Yang ◽

Yanbo Yang ◽

Cecheng Zhao ◽

Kun Yang ◽

Dongyang Wang ◽

...

Keyword(s):

Large Scale ◽

Association Studies ◽

Genotype Imputation ◽

Genome Wide Association Studies ◽

Nucleotide Polymorphisms ◽

High Quality ◽

Single Nucleotide ◽

Genome Wide ◽

Whole Genome Resequencing ◽

Missing Genotypes

Abstract Animal-ImputeDB (http://gong_lab.hzau.edu.cn/Animal_ImputeDB/) is a public database with genomic reference panels of 13 animal species for online genotype imputation, genetic variant search, and free download. Genotype imputation is a process of estimating missing genotypes in terms of the haplotypes and genotypes in a reference panel. It can effectively increase the density of single nucleotide polymorphisms (SNPs) and thus can be widely used in large-scale genome-wide association studies (GWASs) using relatively inexpensive and low-density SNP arrays. However, most animals except humans lack high-quality reference panels, which greatly limits the application of genotype imputation in animals. To overcome this limitation, we developed Animal-ImputeDB, which is dedicated to collecting genotype data and whole-genome resequencing data of nonhuman animals from various studies and databases. A computational pipeline was developed to process different types of raw data to construct reference panels. Finally, 13 high-quality reference panels including ∼400 million SNPs from 2265 samples were constructed. In Animal-ImputeDB, an easy-to-use online tool consisting of two popular imputation tools was designed for the purpose of genotype imputation. Collectively, Animal-ImputeDB serves as an important resource for animal genotype imputation and will greatly facilitate research on animal genomic selection and genetic improvement.

Download Full-text

Phenotypic Characterization of Milk Yield and Quality Traits in a Large Population of Water Buffaloes

Animals ◽

10.3390/ani10020327 ◽

2020 ◽

Vol 10 (2) ◽

pp. 327 ◽

Cited By ~ 2

Author(s):

Angela Costa ◽

Riccardo Negrini ◽

Massimo De Marchi ◽

Giuseppe Campanile ◽

Gianluca Neglia

Keyword(s):

Milk Yield ◽

Milk Fat ◽

Large Scale ◽

Mixed Model ◽

Large Population ◽

Solid Content ◽

Current Status ◽

Phenotypic Characterization ◽

Italian Population ◽

Milk Traits

The buffalo milk industry has economic and social relevance in Italy, as linked to the manufacture of traditional dairy products. To provide an overview of the current status of buffaloes’ performances on a large scale, almost 1 million milk test-day records from 72,294 buffaloes were available to investigate milk yield, energy corrected milk, fat, protein, and lactose content, and somatic cell score (SCS). Phenotypic correlations between milk traits were calculated and analysis of variance was carried out through a mixed model approach including fixed effect of parity, stage of lactation, sampling time, month of calving, and all their interactions and random effects of buffalo, herd-test-date, and residual. Third-parity buffaloes were the most productive in terms of milk yield, while the lowest solid content was detected in sixth parity buffaloes. A considerable gap between primiparous and multiparous buffaloes was observed for milk yield, especially in early- and mid-lactation. Overall, SCS progressively increased with parity and showed a negative correlation with milk yield in both primiparous (−0.12) and multiparous (−0.14) buffaloes. Results suggested that, at the industrial level, milk of primiparous buffaloes may be preferred for transformation purposes, since it was characterized by greater solid content and lower SCS. Results of this study provide a picture of the Italian population of buffaloes under systematic performance records and might be beneficial to both dairy industry and breeding organizations.

Download Full-text

Characterization of expression quantitative trait loci in extensively phenotyped pedigrees ascertained for bipolar disorder

10.1101/031427 ◽

2015 ◽

Author(s):

Christine Peterson ◽

Susan Service ◽

Anna Jasinska ◽

Fuying Gao ◽

Ivette Zelaya ◽

...

Keyword(s):

Gene Expression ◽

Bipolar Disorder ◽

Quantitative Trait Loci ◽

Quantitative Trait ◽

Complex Traits ◽

Expression Quantitative Trait Loci ◽

Genome Wide ◽

Wide Range ◽

Trait Loci

The observation that variants regulating gene expression (expression quantitative trait loci, eQTL) are at a high frequency among SNPs associated with complex traits has made the genome-wide characterization of gene expression an important tool in genetic mapping studies of such traits. As part of a study to identify genetic loci contributing to bipolar disorder and a wide range of BP-related quantitative traits in members of 26 pedigrees from Costa Rica and Colombia, we measured gene expression in lymphoblastoid cell lines derived from 786 pedigree members. The study design enabled us to comprehensively reconstruct the genetic regulatory network in these families, provide estimates of heritability, identify eQTL, evaluate missing heritability for the eQTL, and quantify the number of different alleles contributing to any given locus.

Download Full-text

Ancestral contributions to contemporary European complex traits

10.1101/2021.08.03.454888 ◽

2021 ◽

Author(s):

Davide Marnetto ◽

Vasili Pankratov ◽

Mayukh Mondal ◽

Francesco Montinaro ◽

Katri Pärna ◽

...

Keyword(s):

Bronze Age ◽

Complex Traits ◽

Phenotypic Variability ◽

Complex Trait ◽

Blood Cholesterol ◽

Hunter Gatherers ◽

Genetic Components ◽

Genome Wide ◽

A Genome ◽

Genomic Regions

The contemporary European genetic makeup formed in the last 8000 years as the combination of three main genetic components: the local Western Hunter-Gatherers, the incoming Neolithic Farmers from Anatolia and the Bronze Age component from the Pontic Steppes. When meeting into the post-Neolithic European environment, the genetic variants accumulated during their three distinct evolutionary histories mixed and came into contact with new environmental challenges. Here we investigate how this genetic legacy reflects on the complex trait landscape of contemporary European populations, using the Estonian Biobank as a case study. For the first time we directly connect the phenotypic information available from biobank samples with the genetic similarity to these ancestral groups, both at a genome-wide level and focusing on genomic regions associated with each of the 27 complex traits we investigated. We also found SNPs connected to pigmentation, cholesterol, sleep, diastolic blood pressure, and body mass index (BMI) to show signals of selection following the post Neolithic admixture events. We recapitulate existing knowledge about pigmentation traits, corroborate the connection between Steppe ancestry and height and highlight novel associations. Among others, we report the contribution of Hunter Gatherer ancestry towards high BMI and low blood cholesterol levels. Our results show that the ancient components that form the contemporary European genome were differentiated enough to contribute ancestry-specific signatures to the phenotypic variability displayed by contemporary individuals in at least 11 out of 27 of the complex traits investigated here.

Download Full-text

Investigating tissue-relevant causal molecular mechanisms of complex traits using probabilistic TWAS analysis

10.1101/808295 ◽

2019 ◽

Cited By ~ 2

Author(s):

Yuhua Zhang ◽

Corbin Quick ◽

Ketian Yu ◽

Alvaro Barbeira ◽

Francesca Luca ◽

...

Keyword(s):

Gene Expression ◽

Complex Traits ◽

Large Scale ◽

Molecular Mechanisms ◽

Association Studies ◽

Complex Trait ◽

Causal Effects ◽

Biological Mechanisms ◽

Integrative Framework ◽

Eqtl Data

AbstractTranscriptome-wide association studies (TWAS), an integrative framework using expression quantitative trait loci (eQTLs) to construct proxies for gene expression, have emerged as a promising method to investigate the biological mechanisms underlying associations between genotypes and complex traits. However, challenges remain in interpreting TWAS results, especially regarding their causality implications. In this paper, we describe a new computational framework, probabilistic TWAS (PTWAS), to detect associations and investigate causal relationships between gene expression and complex traits. We use established concepts and principles from instrumental variables (IV) analysis to delineate and address the unique challenges that arise in TWAS. PTWAS utilizes probabilistic eQTL annotations derived from multi-variant Bayesian fine-mapping analysis conferring higher power to detect TWAS associations than existing methods. Additionally, PTWAS provides novel functionalities to evaluate the causal assumptions and estimate tissue- or cell-type specific causal effects of gene expression on complex traits. These features make PTWAS uniquely suited for in-depth investigations of the biological mechanisms that contribute to complex trait variation. Using eQTL data across 49 tissues from GTEx v8, we apply PTWAS to analyze 114 complex traits using GWAS summary statistics from several large-scale projects, including the UK Biobank. Our analysis reveals an abundance of genes with strong evidence of eQTL-mediated causal effects on complex traits and highlights the heterogeneity and tissue-relevance of these effects across complex traits. We distribute software and eQTL annotations to enable users performing rigorous TWAS analysis by leveraging the full potentials of the latest GTEx multi-tissue eQTL data.

Download Full-text

Disease heritability enrichment of regulatory elements is concentrated in elements with ancient sequence age and conserved function across species

10.1101/420166 ◽

2018 ◽

Author(s):

Margaux L.A. Hujoel ◽

Steven Gazal ◽

Farhad Hormozdiari ◽

Bryce van de Geijn ◽

Alkes L. Price

Keyword(s):

Complex Traits ◽

Negative Selection ◽

Target Gene ◽

Regulatory Element ◽

Complex Trait ◽

Regulatory Elements ◽

Mean Value ◽

Regulatory Function ◽

Loss Of Function ◽

The Mean

AbstractRegulatory elements, e.g. enhancers and promoters, have been widely reported to be enriched for disease and complex trait heritability. We investigated how this enrichment varies with the age of the underlying genome sequence, the conservation of regulatory function across species, and the target gene of the regulatory element. We estimated heritability enrichment by applying stratified LD score regression to summary statistics from 41 independent diseases and complex traits (average N =320K) and meta-analyzing results across traits. Enrichment of human enhancers and promoters was larger in elements with older sequence age, assessed via alignment with other species irrespective of conserved functionality: enhancer elements with ancient sequence age (older than the split between marsupial and placental mammals) were 8.8x enriched (vs. 2.5x for all enhancers; p = 3e-14), and promoter elements with ancient sequence age were 13.5x enriched (vs. 5.1x for all promoters; p = 5e-16). Enrichment of human enhancers and promoters was also larger in elements whose regulatory function was conserved across species, e.g. human enhancers that were enhancers in ≥5 of 9 other mammals were 4.6x enriched (p = 5e-12 vs. all enhancers). Enrichment of human promoters was larger in promoters of loss-of-function intolerant genes: 12.0x enrichment (p = 8e-15 vs. all promoters). The mean value of several measures of negative selection within these genomic annotations mirrored all of these findings. Notably, the annotations with these excess heritability enrichments were jointly significant conditional on each other and on our baseline-LD model, which includes a broad set of coding, conserved, regulatory and LD-related annotations.

Download Full-text