scholarly journals Computationally efficient whole genome regression for quantitative and binary traits

Author(s):  
Joelle Mbatchou ◽  
Leland Barnard ◽  
Joshua Backman ◽  
Anthony Marcketta ◽  
Jack A. Kosmicki ◽  
...  

AbstractGenome-wide association analysis of cohorts with thousands of phenotypes is computationally expensive, particularly when accounting for sample relatedness or population structure. Here we present a novel machine learning method called REGENIE for fitting a whole genome regression model that is orders of magnitude faster than alternatives, while maintaining statistical efficiency. The method naturally accommodates parallel analysis of multiple phenotypes, and only requires local segments of the genotype matrix to be loaded in memory, in contrast to existing alternatives which must load genomewide matrices into memory. This results in substantial savings in compute time and memory usage. The method is applicable to both quantitative and binary phenotypes, including rare variant analysis of binary traits with unbalanced case-control ratios where we introduce a fast, approximate Firth logistic regression test. The method is ideally suited to take advantage of distributed computing frameworks. We demonstrate the accuracy and computational benefits of this approach compared to several existing methods using quantitative and binary traits from the UK Biobank dataset with up to 407,746 individuals.

2020 ◽  
Author(s):  
Arunabha Majumdar ◽  
Kathryn S. Burch ◽  
Sriram Sankararaman ◽  
Bogdan Pasaniuc ◽  
W. James Gauderman ◽  
...  

AbstractWhile gene-environment (GxE) interactions contribute importantly to many different phenotypes, detecting such interactions requires well-powered studies and has proven difficult. To address this, we combine two approaches to improve GxE power: simultaneously evaluating multiple phenotypes and using a two-step analysis approach. Previous work shows that the power to identify a main genetic effect can be improved by simultaneously analyzing multiple related phenotypes. For a univariate phenotype, two-step methods produce higher power for detecting a GxE interaction compared to single step analysis. Therefore, we propose a two-step approach to test for an overall GxE effect for multiple phenotypes. Using simulations we demonstrate that, when more than one phenotype has GxE effect (i.e., GxE pleiotropy), our approach offers substantial gain in power (18% – 43%) to detect an aggregate-level GxE effect for a multivariate phenotype compared to an analogous two-step method to identify GxE effect for a univariate phenotype. We applied the proposed approach to simultaneously analyze three lipids, LDL, HDL and Triglyceride with the frequency of alcohol consumption as environmental factor in the UK Biobank. The method identified two independent genome-wide significant signals of an overall GxE effect on the vector of lipids.


PLoS ONE ◽  
2019 ◽  
Vol 14 (8) ◽  
pp. e0220512 ◽  
Author(s):  
Zagaa Odgerel ◽  
Shilpa Sonti ◽  
Nora Hernandez ◽  
Jemin Park ◽  
Ruth Ottman ◽  
...  

2020 ◽  
Vol 11 (1) ◽  
Author(s):  
Elizabeth T. Cirulli ◽  
Simon White ◽  
Robert W. Read ◽  
Gai Elhanan ◽  
William J. Metcalf ◽  
...  

2020 ◽  
Vol 49 (3) ◽  
pp. 1022-1031 ◽  
Author(s):  
Katherine M Siewert ◽  
Derek Klarin ◽  
Scott M Damrauer ◽  
Kyong-Mi Chang ◽  
Philip S Tsao ◽  
...  

Abstract Background Nearly a fifth of the world’s population suffer from migraine headache, yet risk factors for this disease are poorly characterized. Methods To further elucidate these factors, we conducted a genetic correlation analysis using cross-trait linkage disequilibrium (LD) score regression between migraine headache and 47 traits from the UK Biobank. We then tested for possible causality between these phenotypes and migraine, using Mendelian randomization. In addition, we attempted replication of our findings in an independent genome-wide association study (GWAS) when available. Results We report multiple phenotypes with genetic correlation (P  < 1.06 × 10−3) with migraine, including heart disease, type 2 diabetes, lipid levels, blood pressure, autoimmune and psychiatric phenotypes. In particular, we find evidence that blood pressure directly contributes to migraine and explains a previously suggested causal relationship between calcium and migraine. Conclusions This is the largest genetic correlation analysis of migraine headache to date, both in terms of migraine GWAS sample size and the number of phenotypes tested. We find that migraine has a shared genetic basis with a large number of traits, indicating pervasive pleiotropy at migraine-associated loci.


2016 ◽  
Vol 22 (10) ◽  
pp. 1509-1509 ◽  
Author(s):  
S E Legge ◽  
◽  
M L Hamshere ◽  
S Ripke ◽  
A F Pardinas ◽  
...  

2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Dan Jiang ◽  
Cong Xin ◽  
Jinhua Ye ◽  
Yingbo Yuan ◽  
Ming Fang

Abstract Background Genomic prediction is an advanced method for estimating genetic values, which has been widely accepted for genetic evaluation in animal and disease-risk prediction in human. It estimates genetic values with genome-wide distributed SNPs instead of pedigree. The key step of it is to construct genomic relationship matrix (GRM) via genome-wide SNPs; however, usually the calculation of GRM needs huge computer memory especially when the SNP number and sample size are big, so that sometimes it will become computationally prohibitive even for super computer clusters. We herein developed an integrative algorithm to compute GRM. To avoid calculating GRM for the whole genome, ICGRM freely divides the genome-wide SNPs into several segments and computes the summary statistics related to GRM for each segment that requires quite few computer RAM; then it integrates these summary statistics to produce GRM for whole genome. Results It showed that the computer memory of ICGRM was reduced by 15 times (from 218Gb to 14Gb) after the genome SNPs were split into 5 to 200 parts in terms of the number of SNPs in our simulation dataset, making it computationally feasible for almost all kinds of computer servers. ICGRM is implemented in C/C++ and freely available via https://github.com/mingfang618/CLGRM. Conclusions ICGRM is computationally efficient software to build GRM and can be used for big dataset.


2021 ◽  
Author(s):  
Sheila M. Gaynor ◽  
Kenneth E. Westerman ◽  
Lea L. Ackovic ◽  
Xihao Li ◽  
Zilin Li ◽  
...  

AbstractSummaryWe developed the STAAR WDL workflow to facilitate the analysis of rare variants in whole genome sequencing association studies. The open-access STAAR workflow written in the workflow description language (WDL) allows a user to perform rare variant testing for both gene-centric and genetic region approaches, enabling genome-wide, candidate, and conditional analyses. It incorporates functional annotations into the workflow as introduced in the STAAR method in order to boost the rare variant analysis power. This tool was specifically developed and optimized to be implemented on cloud-based platforms such as BioData Catalyst Powered by Terra. It provides easy-to-use functionality for rare variant analysis that can be incorporated into an exhaustive whole genome sequencing analysis pipeline.Availability and implementationThe workflow is freely available from https://dockstore.org/workflows/github.com/sheilagaynor/STAAR_workflow.


2016 ◽  
Author(s):  
Douglas W. Bjelland ◽  
Uday Lingala ◽  
Piyush Patel ◽  
Matt Jones ◽  
Matthew C. Keller

Identical by descent (IBD) segments are used to understand a number of fundamental issues in genetics. IBD segments are typically detected using long stretches of identical alleles between haplotypes in whole-genome SNP data. Phase or SNP call errors in genomic data can degrade accuracy of IBD detection and lead to false positive calls, false negative calls, and under- or overextension of true IBD segments. Furthermore, the number of comparisons increases quadratically with sample size, requiring high computational efficiency. We developed a new IBD segment detection program, FISHR (Find IBD Shared Haplotypes Rapidly), in an attempt to accurately detect IBD segments and to better estimate their endpoints using an algorithm that is fast enough to be deployed on the very large whole-genome SNP datasets. We compared the performance of FISHR to three leading IBD segment detection programs: GERMLINE, refinedIBD, and HaploScore. Using simulated and real genomic sequence data, we show that FISHR is slightly more accurate than all programs at detecting long (greater than 3 cM) IBD segments but slightly less accurate than refinedIBD at detecting short (1 cM) IBD segments. Moreover, FISHR outperforms all programs in determining the true endpoints of IBD segments, which is important for several reasons. FISHR takes two to four times longer than GERMLINE to run, whereas both GERMLINE and FISHR were orders of magnitude faster than refinedIBD and HaploScore. Overall, FISHR provides accurate IBD detection in unrelated individuals and is computationally efficient enough to be utilized on large SNP datasets greater than 20,000 individuals.


Sign in / Sign up

Export Citation Format

Share Document