Computationally efficient whole genome regression for quantitative and binary traits

AbstractGenome-wide association analysis of cohorts with thousands of phenotypes is computationally expensive, particularly when accounting for sample relatedness or population structure. Here we present a novel machine learning method called REGENIE for fitting a whole genome regression model that is orders of magnitude faster than alternatives, while maintaining statistical efficiency. The method naturally accommodates parallel analysis of multiple phenotypes, and only requires local segments of the genotype matrix to be loaded in memory, in contrast to existing alternatives which must load genomewide matrices into memory. This results in substantial savings in compute time and memory usage. The method is applicable to both quantitative and binary phenotypes, including rare variant analysis of binary traits with unbalanced case-control ratios where we introduce a fast, approximate Firth logistic regression test. The method is ideally suited to take advantage of distributed computing frameworks. We demonstrate the accuracy and computational benefits of this approach compared to several existing methods using quantitative and binary traits from the UK Biobank dataset with up to 407,746 individuals.

Download Full-text

Rare variant analysis of blood pressure phenotypes in the Genetic Analysis Workshop 18 whole genome sequencing data using sequence kernel association test

BMC Proceedings ◽

10.1186/1753-6561-8-s1-s10 ◽

2014 ◽

Vol 8 (S1) ◽

Cited By ~ 3

Author(s):

Cates Mallaney ◽

Yun Ju Sung

Keyword(s):

Blood Pressure ◽

Genetic Analysis ◽

Genetic Analysis Workshop ◽

Genome Sequencing ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequence Kernel Association Test ◽

Sequencing Data ◽

Rare Variant Analysis ◽

Variant Analysis

Download Full-text

A two-step approach to testing overall effect of gene-environment interaction for multiple phenotypes

10.1101/2020.07.06.190256 ◽

2020 ◽

Author(s):

Arunabha Majumdar ◽

Kathryn S. Burch ◽

Sriram Sankararaman ◽

Bogdan Pasaniuc ◽

W. James Gauderman ◽

...

Keyword(s):

Genetic Effect ◽

Single Step ◽

Environment Interaction ◽

Step Method ◽

Gene Environment Interaction ◽

Gxe Interaction ◽

Multiple Phenotypes ◽

Gene Environment ◽

Genome Wide ◽

The Uk

AbstractWhile gene-environment (GxE) interactions contribute importantly to many different phenotypes, detecting such interactions requires well-powered studies and has proven difficult. To address this, we combine two approaches to improve GxE power: simultaneously evaluating multiple phenotypes and using a two-step analysis approach. Previous work shows that the power to identify a main genetic effect can be improved by simultaneously analyzing multiple related phenotypes. For a univariate phenotype, two-step methods produce higher power for detecting a GxE interaction compared to single step analysis. Therefore, we propose a two-step approach to test for an overall GxE effect for multiple phenotypes. Using simulations we demonstrate that, when more than one phenotype has GxE effect (i.e., GxE pleiotropy), our approach offers substantial gain in power (18% – 43%) to detect an aggregate-level GxE effect for a multivariate phenotype compared to an analogous two-step method to identify GxE effect for a univariate phenotype. We applied the proposed approach to simultaneously analyze three lipids, LDL, HDL and Triglyceride with the frequency of alcohol consumption as environmental factor in the UK Biobank. The method identified two independent genome-wide significant signals of an overall GxE effect on the vector of lipids.

Download Full-text

Whole genome sequencing and rare variant analysis in essential tremor families

PLoS ONE ◽

10.1371/journal.pone.0220512 ◽

2019 ◽

Vol 14 (8) ◽

pp. e0220512 ◽

Cited By ~ 6

Author(s):

Zagaa Odgerel ◽

Shilpa Sonti ◽

Nora Hernandez ◽

Jemin Park ◽

Ruth Ottman ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Essential Tremor ◽

Genome Sequencing ◽

Rare Variant ◽

Whole Genome ◽

Rare Variant Analysis ◽

Variant Analysis

Download Full-text

Genome-wide rare variant analysis for thousands of phenotypes in over 70,000 exomes from two cohorts

Nature Communications ◽

10.1038/s41467-020-14288-y ◽

2020 ◽

Vol 11 (1) ◽

Cited By ~ 11

Author(s):

Elizabeth T. Cirulli ◽

Simon White ◽

Robert W. Read ◽

Gai Elhanan ◽

William J. Metcalf ◽

...

Keyword(s):

Rare Variant ◽

Rare Variant Analysis ◽

Genome Wide ◽

Variant Analysis

Download Full-text

Cross-trait analyses with migraine reveal widespread pleiotropy and suggest a vascular component to migraine headache

International Journal of Epidemiology ◽

10.1093/ije/dyaa050 ◽

2020 ◽

Vol 49 (3) ◽

pp. 1022-1031 ◽

Cited By ~ 4

Author(s):

Katherine M Siewert ◽

Derek Klarin ◽

Scott M Damrauer ◽

Kyong-Mi Chang ◽

Philip S Tsao ◽

...

Keyword(s):

Blood Pressure ◽

Correlation Analysis ◽

Genetic Correlation ◽

Migraine Headache ◽

Genome Wide Association Study ◽

Multiple Phenotypes ◽

Genome Wide ◽

Vascular Component ◽

The Uk

Abstract Background Nearly a fifth of the world’s population suffer from migraine headache, yet risk factors for this disease are poorly characterized. Methods To further elucidate these factors, we conducted a genetic correlation analysis using cross-trait linkage disequilibrium (LD) score regression between migraine headache and 47 traits from the UK Biobank. We then tested for possible causality between these phenotypes and migraine, using Mendelian randomization. In addition, we attempted replication of our findings in an independent genome-wide association study (GWAS) when available. Results We report multiple phenotypes with genetic correlation (P < 1.06 × 10−3) with migraine, including heart disease, type 2 diabetes, lipid levels, blood pressure, autoimmune and psychiatric phenotypes. In particular, we find evidence that blood pressure directly contributes to migraine and explains a previously suggested causal relationship between calcium and migraine. Conclusions This is the largest genetic correlation analysis of migraine headache to date, both in terms of migraine GWAS sample size and the number of phenotypes tested. We find that migraine has a shared genetic basis with a large number of traits, indicating pervasive pleiotropy at migraine-associated loci.

Download Full-text

Erratum: Genome-wide common and rare variant analysis provides novel insights into clozapine-associated neutropenia

Molecular Psychiatry ◽

10.1038/mp.2016.137 ◽

2016 ◽

Vol 22 (10) ◽

pp. 1509-1509 ◽

Cited By ~ 6

Author(s):

S E Legge ◽

◽

M L Hamshere ◽

S Ripke ◽

A F Pardinas ◽

...

Keyword(s):

Rare Variant ◽

Rare Variant Analysis ◽

Genome Wide ◽

Variant Analysis

Download Full-text

Comprehensive Rare Variant Analysis via Whole-Genome Sequencing to Determine the Molecular Pathology of Inherited Retinal Disease

The American Journal of Human Genetics ◽

10.1016/j.ajhg.2016.12.003 ◽

2017 ◽

Vol 100 (1) ◽

pp. 75-90 ◽

Cited By ~ 169

Author(s):

Keren J. Carss ◽

Gavin Arno ◽

Marie Erwood ◽

Jonathan Stephens ◽

Alba Sanchis-Juan ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Rare Variant ◽

Molecular Pathology ◽

Retinal Disease ◽

Whole Genome ◽

Rare Variant Analysis ◽

Inherited Retinal Disease ◽

Variant Analysis

Download Full-text

ICGRM: integrative construction of genomic relationship matrix combining multiple genomic regions for big dataset

BMC Bioinformatics ◽

10.1186/s12859-019-3319-y ◽

2019 ◽

Vol 20 (1) ◽

Author(s):

Dan Jiang ◽

Cong Xin ◽

Jinhua Ye ◽

Yingbo Yuan ◽

Ming Fang

Keyword(s):

Disease Risk ◽

Genomic Relationship Matrix ◽

Relationship Matrix ◽

Computer Memory ◽

Whole Genome ◽

Summary Statistics ◽

Genomic Relationship ◽

Computationally Efficient ◽

Computer Clusters ◽

Genome Wide

Abstract Background Genomic prediction is an advanced method for estimating genetic values, which has been widely accepted for genetic evaluation in animal and disease-risk prediction in human. It estimates genetic values with genome-wide distributed SNPs instead of pedigree. The key step of it is to construct genomic relationship matrix (GRM) via genome-wide SNPs; however, usually the calculation of GRM needs huge computer memory especially when the SNP number and sample size are big, so that sometimes it will become computationally prohibitive even for super computer clusters. We herein developed an integrative algorithm to compute GRM. To avoid calculating GRM for the whole genome, ICGRM freely divides the genome-wide SNPs into several segments and computes the summary statistics related to GRM for each segment that requires quite few computer RAM; then it integrates these summary statistics to produce GRM for whole genome. Results It showed that the computer memory of ICGRM was reduced by 15 times (from 218Gb to 14Gb) after the genome SNPs were split into 5 to 200 parts in terms of the number of SNPs in our simulation dataset, making it computationally feasible for almost all kinds of computer servers. ICGRM is implemented in C/C++ and freely available via https://github.com/mingfang618/CLGRM. Conclusions ICGRM is computationally efficient software to build GRM and can be used for big dataset.

Download Full-text

STAAR Workflow: A cloud-based workflow for scalable and reproducible rare variant analysis

10.1101/2021.09.07.456116 ◽

2021 ◽

Author(s):

Sheila M. Gaynor ◽

Kenneth E. Westerman ◽

Lea L. Ackovic ◽

Xihao Li ◽

Zilin Li ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Rare Variant ◽

Rare Variants ◽

Association Studies ◽

Whole Genome ◽

Sequencing Analysis ◽

Functional Annotations ◽

Rare Variant Analysis ◽

Variant Analysis

AbstractSummaryWe developed the STAAR WDL workflow to facilitate the analysis of rare variants in whole genome sequencing association studies. The open-access STAAR workflow written in the workflow description language (WDL) allows a user to perform rare variant testing for both gene-centric and genetic region approaches, enabling genome-wide, candidate, and conditional analyses. It incorporates functional annotations into the workflow as introduced in the STAAR method in order to boost the rare variant analysis power. This tool was specifically developed and optimized to be implemented on cloud-based platforms such as BioData Catalyst Powered by Terra. It provides easy-to-use functionality for rare variant analysis that can be incorporated into an exhaustive whole genome sequencing analysis pipeline.Availability and implementationThe workflow is freely available from https://dockstore.org/workflows/github.com/sheilagaynor/STAAR_workflow.

Download Full-text

A fast and accurate method for detection of IBD shared haplotypes in genome-wide SNP data

10.1101/042879 ◽

2016 ◽

Author(s):

Douglas W. Bjelland ◽

Uday Lingala ◽

Piyush Patel ◽

Matt Jones ◽

Matthew C. Keller

Keyword(s):

Genomic Sequence ◽

Sequence Data ◽

False Negative ◽

Accurate Method ◽

Whole Genome ◽

Computationally Efficient ◽

Snp Data ◽

Genome Wide ◽

Detection Program ◽

High Computational Efficiency

Identical by descent (IBD) segments are used to understand a number of fundamental issues in genetics. IBD segments are typically detected using long stretches of identical alleles between haplotypes in whole-genome SNP data. Phase or SNP call errors in genomic data can degrade accuracy of IBD detection and lead to false positive calls, false negative calls, and under- or overextension of true IBD segments. Furthermore, the number of comparisons increases quadratically with sample size, requiring high computational efficiency. We developed a new IBD segment detection program, FISHR (Find IBD Shared Haplotypes Rapidly), in an attempt to accurately detect IBD segments and to better estimate their endpoints using an algorithm that is fast enough to be deployed on the very large whole-genome SNP datasets. We compared the performance of FISHR to three leading IBD segment detection programs: GERMLINE, refinedIBD, and HaploScore. Using simulated and real genomic sequence data, we show that FISHR is slightly more accurate than all programs at detecting long (greater than 3 cM) IBD segments but slightly less accurate than refinedIBD at detecting short (1 cM) IBD segments. Moreover, FISHR outperforms all programs in determining the true endpoints of IBD segments, which is important for several reasons. FISHR takes two to four times longer than GERMLINE to run, whereas both GERMLINE and FISHR were orders of magnitude faster than refinedIBD and HaploScore. Overall, FISHR provides accurate IBD detection in unrelated individuals and is computationally efficient enough to be utilized on large SNP datasets greater than 20,000 individuals.

Download Full-text