Statistical Power of Model Selection Strategies for Genome-Wide Association Studies

AbstractBackgroundRecent developments in technology result in the generation of big data. In genome-wide association studies (GWAS), we can get tens of million SNPs that need to be tested for association with a trait of interest. Indeed, this poses a great computational challenge. There is a need for developing fast algorithms in GWAS methodologies. These algorithms must ensure high power in QTN detection, high accuracy in QTN estimation and low false positive rate.ResultsHere, we accelerated mrMLM algorithm by using GEMMA idea, matrix transformations and identities. The target functions and derivatives in vector/matrix forms for each marker scanning are transformed into some simple forms that are easy and efficient to evaluate during each optimization step. All potentially associated QTNs with P-values ≤ 0.01 are evaluated in a multi-locus model by LARS algorithm and/or EM-Empirical Bayes. We call the algorithm FASTmrMLM. Numerical simulation studies and real data analysis validated the FASTmrMLM. FASTmrMLM reduces the running time in mrMLM by more than 50%. FASTmrMLM also shows high statistical power in QTN detection, high accuracy in QTN estimation and low false positive rate as compared to GEMMA, FarmCPU and mrMLM. Real data analysis shows that FASTmrMLM was able to detect more previously reported genes than all the other methods: GEMMA/EMMA, FarmCPU and mrMLM.ConclusionsFASTmrMLM is a fast and reliable algorithm in multi-locus GWAS and ensures high statistical power, high accuracy of estimates and low false positive rate.Author SummaryThe current developments in technology result in the generation of a vast amount of data. In genome-wide association studies, we can get tens of million markers that need to be tested for association with a trait of interest. Due to the computational challenge faced, we developed a fast algorithm for genome-wide association studies. Our approach is a two stage method. In the first step, we used matrix transformations and identities to quicken the testing of each random marker effect. The target functions and derivatives which are in vector/matrix forms for each marker scanning are transformed into some simple forms that are easy and efficient to evaluate during each optimization step. In the second step, we selected all potentially associated SNPs and evaluated them in a multi-locus model. From simulation studies, our algorithm significantly reduces the computing time. The new method also shows high statistical power in detecting significant markers, high accuracy in marker effect estimation and low false positive rate. We also used the new method to identify relevant genes in real data analysis. We recommend our approach as a fast and reliable method for carrying out a multi-locus genome-wide association study.

Download Full-text

Across-cohort QC analyses of genome-wide association study summary statistics from complex traits

10.1101/033787 ◽

2015 ◽

Author(s):

Guo-Bo Chen ◽

Sang Hong Lee ◽

Matthew R Robinson ◽

Maciej Trzaskowski ◽

Zhi-Xiang Zhu ◽

...

Keyword(s):

Complex Traits ◽

Statistical Power ◽

Association Studies ◽

False Negative ◽

Genome Wide Association ◽

Effect Sizes ◽

Genome Wide Association Studies ◽

Summary Statistics ◽

Unknown Sample ◽

Genome Wide

Genome-wide association studies (GWASs) have been successful in discovering replicable SNP-trait associations for many quantitative traits and common diseases in humans. Typically the effect sizes of SNP alleles are very small and this has led to large genome-wide association meta-analyses (GWAMA) to maximize statistical power. A trend towards ever-larger GWAMA is likely to continue, yet dealing with summary statistics from hundreds of cohorts increases logistical and quality control problems, including unknown sample overlap, and these can lead to both false positive and false negative findings. In this study we propose a new set of metrics and visualization tools for GWAMA, using summary statistics from cohort-level GWASs. We proposed a pair of methods in examining the concordance between demographic information and summary statistics. In method I, we use the population genetics Fststatistic to verify the genetic origin of each cohort and their geographic location, and demonstrate using GWAMA data from the GIANT Consortium that geographic locations of cohorts can be recovered and outlier cohorts can be detected. In method II, we conduct principal component analysis based on reported allele frequencies, and is able to recover the ancestral information for each cohort. In addition, we propose a new statistic that uses the reported allelic effect sizes and their standard errors to identify significant sample overlap or heterogeneity between pairs of cohorts. Finally, to quantify unknown sample overlap across all pairs of cohorts we propose a method that uses randomly generated genetic predictors that does not require the sharing of individual-level genotype data and does not breach individual privacy.

Download Full-text

Issues with data transformation in genome-wide association studies for phenotypic variability

F1000Research ◽

10.12688/f1000research.2-200.v1 ◽

2013 ◽

Vol 2 ◽

pp. 200 ◽

Cited By ~ 6

Author(s):

Xia Shen ◽

Lars Rönnegård

Keyword(s):

Statistical Power ◽

Human Genetics ◽

Association Studies ◽

Phenotypic Variability ◽

Data Transformation ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

The Real ◽

Genome Wide

The purpose of this correspondence is to discuss and clarify a few points about data transformation used in genome-wide association studies, especially for phenotypic variability. By commenting on the recent publication by Sun et al. in the American Journal of Human Genetics, we emphasize the importance of statistical power in detecting functional loci and the real meaning of the scale of the phenotype in practice.

Download Full-text

An atlas of genetic associations in UK Biobank

10.1101/176834 ◽

2017 ◽

Cited By ~ 18

Author(s):

Oriol Canela-Xandri ◽

Konrad Rawlik ◽

Albert Tenesa

Keyword(s):

Complex Traits ◽

Statistical Power ◽

Association Studies ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Uk Biobank ◽

Genetic Associations ◽

Genome Wide ◽

Related Individuals ◽

Sufficient Statistical Power

ABSTRACTGenome-wide association studies have revealed many loci contributing to the variation of complex traits, yet the majority of loci that contribute to the heritability of complex traits remain elusive. Large study populations with sufficient statistical power are required to detect the small effect sizes of the yet unidentified genetic variants. However, the analysis of huge cohorts, like UK Biobank, is complicated by incidental structure present when collecting such large cohorts. For instance, UK Biobank comprises 107,162 third degree or closer related participants. Traditionally, GWAS have removed related individuals because they comprised an insignificant proportion of the overall sample size, however, removing related individuals in UK Biobank would entail a substantial loss of power. Furthermore, modelling such structure using linear mixed models is computationally expensive, which requires a computational infrastructure that may not be accessible to all researchers. Here we present an atlas of genetic associations for 118 non-binary and 599 binary traits of 408,455 related and unrelated UK Biobank participants of White-British descent. Results are compiled in a publicly accessible database that allows querying genome-wide association summary results for 623,944 genotyped and HapMap2 imputed SNPs, as well downloading whole GWAS summary statistics for over 30 million imputed SNPs from the Haplotype Reference Consortium panel. Our atlas of associations (GeneATLAS,http://geneatlas.roslin.ed.ac.uk) will help researchers to query UK Biobank results in an easy way without the need to incur in high computational costs.

Download Full-text