Computationally efficient whole-genome regression for quantitative and binary traits

AbstractGenome-wide association analysis of cohorts with thousands of phenotypes is computationally expensive, particularly when accounting for sample relatedness or population structure. Here we present a novel machine learning method called REGENIE for fitting a whole genome regression model that is orders of magnitude faster than alternatives, while maintaining statistical efficiency. The method naturally accommodates parallel analysis of multiple phenotypes, and only requires local segments of the genotype matrix to be loaded in memory, in contrast to existing alternatives which must load genomewide matrices into memory. This results in substantial savings in compute time and memory usage. The method is applicable to both quantitative and binary phenotypes, including rare variant analysis of binary traits with unbalanced case-control ratios where we introduce a fast, approximate Firth logistic regression test. The method is ideally suited to take advantage of distributed computing frameworks. We demonstrate the accuracy and computational benefits of this approach compared to several existing methods using quantitative and binary traits from the UK Biobank dataset with up to 407,746 individuals.

Download Full-text

Hybrid peeling for fast and accurate calling, phasing, and imputation with sequence data of any coverage in pedigrees

10.1101/228999 ◽

2017 ◽

Cited By ~ 7

Author(s):

Andrew Whalen ◽

Roger Ros-Freixedes ◽

David L Wilson ◽

Gregor Gorjanc ◽

John M Hickey

Keyword(s):

Genome Sequence ◽

Sequence Data ◽

Low Cost ◽

Whole Genome Sequence ◽

Sequence Information ◽

Whole Genome ◽

Computationally Efficient ◽

Genome Sequence Data ◽

Iterative Peeling ◽

Chromosome Segments

AbstractIn this paper we extend multi-locus iterative peeling to be a computationally efficient method for calling, phasing, and imputing sequence data of any coverage in small or large pedigrees. Our method, called hybrid peeling, uses multi-locus iterative peeling to estimate shared chromosome segments between parents and their offspring, and then uses single-locus iterative peeling to aggregate genomic information across multiple generations. Using a synthetic dataset, we first analysed the performance of hybrid peeling for calling and phasing alleles in disconnected families, families which contained only a focal individual and its parents and grandparents. Second, we analysed the performance of hybrid peeling for calling and phasing alleles in the context of the full pedigree. Third, we analysed the performance of hybrid peeling for imputing whole genome sequence data to the remaining individuals in the population. We found that hybrid peeling substantially increase the number of genotypes that were called and phased by leveraging sequence information on related individuals. The calling rate and accuracy increased when the full pedigree was used compared to a reduced pedigree of just parents and grandparents. Finally, hybrid peeling accurately imputed whole genome sequence information to non-sequenced individuals. We believe that this algorithm will enable the generation of low cost and high accuracy whole genome sequence data in many pedigreed populations. We are making this algorithm available as a standalone program called AlphaPeel.

Download Full-text

Efficient Variant Set Mixed Model Association Tests for Continuous and Binary Traits in Large-Scale Whole-Genome Sequencing Studies

The American Journal of Human Genetics ◽

10.1016/j.ajhg.2018.12.012 ◽

2019 ◽

Vol 104 (2) ◽

pp. 260-274 ◽

Cited By ~ 26

Author(s):

Han Chen ◽

Jennifer E. Huffman ◽

Jennifer A. Brody ◽

Chaolong Wang ◽

Seunggeun Lee ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Large Scale ◽

Mixed Model ◽

Whole Genome ◽

Association Tests ◽

Binary Traits ◽

Sequencing Studies

Download Full-text

PyClone-VI: scalable inference of clonal population structures using whole genome data

BMC Bioinformatics ◽

10.1186/s12859-020-03919-2 ◽

2020 ◽

Vol 21 (1) ◽

Author(s):

Sierra Gillis ◽

Andrew Roth

Keyword(s):

Malignant Cell ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Cancer Evolution ◽

Sequencing Data ◽

Computationally Efficient ◽

Clonal Population ◽

Genome Data ◽

Population Structures ◽

Clonal Population Structure

Abstract Background At diagnosis tumours are typically composed of a mixture of genomically distinct malignant cell populations. Bulk sequencing of tumour samples coupled with computational deconvolution can be used to identify these populations and study cancer evolution. Existing computational methods for populations deconvolution are slow and/or potentially inaccurate when applied to large datasets generated by whole genome sequencing data. Results We describe PyClone-VI, a computationally efficient Bayesian statistical method for inferring the clonal population structure of cancers. We demonstrate the utility of the method by analyzing data from 1717 patients from PCAWG study and 100 patients from the TRACERx study. Conclusions Our proposed method is 10–100× times faster than existing methods, while providing results which are as accurate. Software implementing our method is freely available https://github.com/Roth-Lab/pyclone-vi.

Download Full-text

Mining Whole Genome Sequence data to efficiently attribute individuals to source populations

10.1101/2020.02.03.932343 ◽

2020 ◽

Author(s):

Francisco J. Pérez-Reche ◽

Ovidiu Rotariu ◽

Bruno S. Lopes ◽

Ken J. Forbes ◽

Norval J.C. Strachan

Keyword(s):

Genome Sequence ◽

Sequence Data ◽

Large Data ◽

Biological Species ◽

Whole Genome Sequence ◽

Data Sets ◽

Whole Genome ◽

Computationally Efficient ◽

Proteomic Data ◽

Source Populations

ABSTRACTWhole genome sequence (WGS) data could transform our ability to attribute individuals to source populations. However, methods that effectively mine these data are yet to be developed. We present a minimal multilocus distance (MMD) method which rapidly deals with these large data sets as well as methods for optimally selecting loci. This was applied on WGS data to determine the source of human campylobacteriosis, the geographical origin of diverse biological species including humans and proteomic data to classify breast cancer tumours. The MMD method provides a highly accurate attribution which is computationally efficient for extended genotypes. These methods are generic, easy to implement for WGS and proteomic data and have wide application.

Download Full-text

PEMapper and PECaller provide a simplified approach to whole-genome sequencing

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1618065114 ◽

2017 ◽

Vol 114 (10) ◽

pp. E1923-E1932 ◽

Cited By ~ 20

Author(s):

H. Richard Johnston ◽

Pankaj Chopra ◽

Thomas S. Wingo ◽

Viren Patel ◽

Michael P. Epstein ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Computationally Efficient ◽

Network Resources ◽

Simplified Approach ◽

Statistical Framework ◽

Genome Analysis Toolkit

The analysis of human whole-genome sequencing data presents significant computational challenges. The sheer size of datasets places an enormous burden on computational, disk array, and network resources. Here, we present an integrated computational package, PEMapper/PECaller, that was designed specifically to minimize the burden on networks and disk arrays, create output files that are minimal in size, and run in a highly computationally efficient way, with the single goal of enabling whole-genome sequencing at scale. In addition to improved computational efficiency, we implement a statistical framework that allows for a base by base error model, allowing this package to perform as well or better than the widely used Genome Analysis Toolkit (GATK) in all key measures of performance on human whole-genome sequences.

Download Full-text

ICGRM: integrative construction of genomic relationship matrix combining multiple genomic regions for big dataset

BMC Bioinformatics ◽

10.1186/s12859-019-3319-y ◽

2019 ◽

Vol 20 (1) ◽

Author(s):

Dan Jiang ◽

Cong Xin ◽

Jinhua Ye ◽

Yingbo Yuan ◽

Ming Fang

Keyword(s):

Disease Risk ◽

Genomic Relationship Matrix ◽

Relationship Matrix ◽

Computer Memory ◽

Whole Genome ◽

Summary Statistics ◽

Genomic Relationship ◽

Computationally Efficient ◽

Computer Clusters ◽

Genome Wide

Abstract Background Genomic prediction is an advanced method for estimating genetic values, which has been widely accepted for genetic evaluation in animal and disease-risk prediction in human. It estimates genetic values with genome-wide distributed SNPs instead of pedigree. The key step of it is to construct genomic relationship matrix (GRM) via genome-wide SNPs; however, usually the calculation of GRM needs huge computer memory especially when the SNP number and sample size are big, so that sometimes it will become computationally prohibitive even for super computer clusters. We herein developed an integrative algorithm to compute GRM. To avoid calculating GRM for the whole genome, ICGRM freely divides the genome-wide SNPs into several segments and computes the summary statistics related to GRM for each segment that requires quite few computer RAM; then it integrates these summary statistics to produce GRM for whole genome. Results It showed that the computer memory of ICGRM was reduced by 15 times (from 218Gb to 14Gb) after the genome SNPs were split into 5 to 200 parts in terms of the number of SNPs in our simulation dataset, making it computationally feasible for almost all kinds of computer servers. ICGRM is implemented in C/C++ and freely available via https://github.com/mingfang618/CLGRM. Conclusions ICGRM is computationally efficient software to build GRM and can be used for big dataset.

Download Full-text

PyClone-VI: Scalable inference of clonal population structures using whole genome data

10.1101/2020.08.31.276212 ◽

2020 ◽

Author(s):

Sierra Gillis ◽

Andrew Roth

Keyword(s):

Population Structure ◽

Statistical Method ◽

Whole Genome ◽

Computationally Efficient ◽

Clonal Population ◽

Link Type ◽

Genome Data ◽

Population Structures ◽

Scalable Inference ◽

Clonal Population Structure

AbstractWe describe PyClone-VI, a computationally efficient Bayesian statistical method for inferring the clonal population structure of cancers. Our proposed method is 10-100x times faster than existing methods, while providing results which are as accurate. We demonstrate the utility of the method by analyzing data from 1717 patients from PCAWG study and 100 patients from the TRACERx study. Software implementing our method is freely available https://github.com/Roth-Lab/pyclone-vi.

Download Full-text

PEMapper / PECaller: A simplified approach to whole-genome sequencing

10.1101/076968 ◽

2016 ◽

Cited By ~ 2

Author(s):

H Richard Johnston ◽

Pankaj Chopra ◽

Thomas S Wingo ◽

Viren Patel ◽

Michael P Epstein ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Computationally Efficient ◽

Network Resources ◽

Simplified Approach ◽

Statistical Framework ◽

Genome Analysis Toolkit

ABSTRACTThe analysis of human whole-genome sequencing data presents significant computational challenges. The sheer size of datasets places an enormous burden on computational, disk array, and network resources. Here we present an integrated computational package, PEMapper/PECaller, that was designed specifically to minimize the burden on networks and disk arrays, create output files that are minimal in size, and run in a highly computationally efficient way, with the single goal of enabling whole-genome sequencing at scale. In addition to improved computational efficiency, we implement a novel statistical framework that allows for a base-by-base error model, allowing this package to perform as well or better than the widely used Genome Analysis Toolkit (GATK) in all key measures of performance on human whole-genome sequences.

Download Full-text

Efficient variant set mixed model association tests for continuous and binary traits in large-scale whole genome sequencing studies

10.1101/395046 ◽

2018 ◽

Cited By ~ 2

Author(s):

Han Chen ◽

Jennifer E. Huffman ◽

Jennifer A. Brody ◽

Chaolong Wang ◽

Seunggeun Lee ◽

...

Keyword(s):

Population Structure ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Large Scale ◽

Mixed Model ◽

Null Model ◽

Type I ◽

Whole Genome ◽

Association Tests ◽

Binary Traits

ABSTRACTWith advances in Whole Genome Sequencing (WGS) technology, more advanced statistical methods for testing genetic association with rare variants are being developed. Methods in which variants are grouped for analysis are also known as variant-set, gene-based, and aggregate unit tests. The burden test and Sequence Kernel Association Test (SKAT) are two widely used variant-set tests, which were originally developed for samples of unrelated individuals and later have been extended to family data with known pedigree structures. However, computationally-efficient and powerful variant-set tests are needed to make analyses tractable in large-scale WGS studies with complex study samples. In this paper, we propose the variant-Set Mixed Model Association Tests (SMMAT) for continuous and binary traits using the generalized linear mixed model framework. These tests can be applied to large-scale WGS studies involving samples with population structure and relatedness, such as in the National Heart, Lung, and Blood Institute’s Trans-Omics for Precision Medicine (TOPMed) program. SMMAT tests share the same null model for different variant sets, and a virtue of this null model, which includes covariates only, is that it needs to be only fit once for all tests in each genome-wide analysis. Simulation studies show that all the proposed SMMAT tests correctly control type I error rates for both continuous and binary traits in the presence of population structure and relatedness. We also illustrate our tests in a real data example of analysis of plasma fibrinogen levels in the TOPMed program (n = 23,763), using the Analysis Commons, a cloud-based computing platform.

Download Full-text