A fast mrMLM algorithm for multi-locus genome-wide association studies

AbstractBackgroundRecent developments in technology result in the generation of big data. In genome-wide association studies (GWAS), we can get tens of million SNPs that need to be tested for association with a trait of interest. Indeed, this poses a great computational challenge. There is a need for developing fast algorithms in GWAS methodologies. These algorithms must ensure high power in QTN detection, high accuracy in QTN estimation and low false positive rate.ResultsHere, we accelerated mrMLM algorithm by using GEMMA idea, matrix transformations and identities. The target functions and derivatives in vector/matrix forms for each marker scanning are transformed into some simple forms that are easy and efficient to evaluate during each optimization step. All potentially associated QTNs with P-values ≤ 0.01 are evaluated in a multi-locus model by LARS algorithm and/or EM-Empirical Bayes. We call the algorithm FASTmrMLM. Numerical simulation studies and real data analysis validated the FASTmrMLM. FASTmrMLM reduces the running time in mrMLM by more than 50%. FASTmrMLM also shows high statistical power in QTN detection, high accuracy in QTN estimation and low false positive rate as compared to GEMMA, FarmCPU and mrMLM. Real data analysis shows that FASTmrMLM was able to detect more previously reported genes than all the other methods: GEMMA/EMMA, FarmCPU and mrMLM.ConclusionsFASTmrMLM is a fast and reliable algorithm in multi-locus GWAS and ensures high statistical power, high accuracy of estimates and low false positive rate.Author SummaryThe current developments in technology result in the generation of a vast amount of data. In genome-wide association studies, we can get tens of million markers that need to be tested for association with a trait of interest. Due to the computational challenge faced, we developed a fast algorithm for genome-wide association studies. Our approach is a two stage method. In the first step, we used matrix transformations and identities to quicken the testing of each random marker effect. The target functions and derivatives which are in vector/matrix forms for each marker scanning are transformed into some simple forms that are easy and efficient to evaluate during each optimization step. In the second step, we selected all potentially associated SNPs and evaluated them in a multi-locus model. From simulation studies, our algorithm significantly reduces the computing time. The new method also shows high statistical power in detecting significant markers, high accuracy in marker effect estimation and low false positive rate. We also used the new method to identify relevant genes in real data analysis. We recommend our approach as a fast and reliable method for carrying out a multi-locus genome-wide association study.

Download Full-text

PopCluster: an algorithm to identify genetic variants with ethnicity-dependent effects

Bioinformatics ◽

10.1093/bioinformatics/btz017 ◽

2019 ◽

Vol 35 (17) ◽

pp. 3046-3054 ◽

Cited By ~ 2

Author(s):

Anastasia Gurinovich ◽

Harold Bae ◽

John J Farrell ◽

Stacy L Andersen ◽

Stefano Monti ◽

...

Keyword(s):

Genetic Variants ◽

Association Studies ◽

False Positive Rate ◽

Principal Component ◽

True Positive Rate ◽

Genome Wide Association ◽

Supplementary Information ◽

Genome Wide Association Studies ◽

Genome Wide ◽

Positive Rate

Abstract Motivation Over the last decade, more diverse populations have been included in genome-wide association studies. If a genetic variant has a varying effect on a phenotype in different populations, genome-wide association studies applied to a dataset as a whole may not pinpoint such differences. It is especially important to be able to identify population-specific effects of genetic variants in studies that would eventually lead to development of diagnostic tests or drug discovery. Results In this paper, we propose PopCluster: an algorithm to automatically discover subsets of individuals in which the genetic effects of a variant are statistically different. PopCluster provides a simple framework to directly analyze genotype data without prior knowledge of subjects’ ethnicities. PopCluster combines logistic regression modeling, principal component analysis, hierarchical clustering and a recursive bottom-up tree parsing procedure. The evaluation of PopCluster suggests that the algorithm has a stable low false positive rate (∼4%) and high true positive rate (>80%) in simulations with large differences in allele frequencies between cases and controls. Application of PopCluster to data from genetic studies of longevity discovers ethnicity-dependent heterogeneity in the association of rs3764814 (USP42) with the phenotype. Availability and implementation PopCluster was implemented using the R programming language, PLINK and Eigensoft software, and can be found at the following GitHub repository: https://github.com/gurinovich/PopCluster with instructions on its installation and usage. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Power Analysis of C-TDT for Small Sample Size Genome-Wide Association Studies by the Joint Use of Case-Parent Trios and Pairs

Computational and Mathematical Methods in Medicine ◽

10.1155/2013/235825 ◽

2013 ◽

Vol 2013 ◽

pp. 1-7 ◽

Cited By ~ 1

Author(s):

Farid Rajabli ◽

Gul Inan ◽

Ozlem Ilk

Keyword(s):

False Positive ◽

Association Studies ◽

False Positive Rate ◽

Small Sample ◽

Genome Wide Association ◽

Sample Sizes ◽

Test Statistic ◽

Genome Wide ◽

Positive Rate ◽

Family Based

In family-based genetic association studies, it is possible to encounter missing genotype information for one of the parents. This leads to a study consisting of both case-parent trios and case-parent pairs. One of the approaches to this problem is permutation-based combined transmission disequilibrium test statistic. However, it is still unknown how powerful this test statistic is with small sample sizes. In this paper, a simulation study is carried out to estimate the power and false positive rate of this test across different sample sizes for a family-based genome-wide association study. It is observed that a statistical power of over 80% and a reasonable false positive rate estimate can be achieved even with a combination of 50 trios and 30 pairs when 2% of the SNPs are assumed to be associated. Moreover, even smaller samples provide high power when smaller percentages of SNPs are associated with the disease.

Download Full-text

Statistical power and utility of meta-analysis methods for cross-phenotype genome-wide association studies

PLoS ONE ◽

10.1371/journal.pone.0193256 ◽

2018 ◽

Vol 13 (3) ◽

pp. e0193256 ◽

Cited By ~ 13

Author(s):

Zhaozhong Zhu ◽

Verneri Anttila ◽

Jordan W. Smoller ◽

Phil H. Lee

Keyword(s):

Statistical Power ◽

Association Studies ◽

Meta Analysis ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Analysis Methods ◽

Genome Wide

Download Full-text

Statistical Power of Model Selection Strategies for Genome-Wide Association Studies

PLoS Genetics ◽

10.1371/journal.pgen.1000582 ◽

2009 ◽

Vol 5 (7) ◽

pp. e1000582 ◽

Cited By ~ 14

Author(s):

Zheyang Wu ◽

Hongyu Zhao

Keyword(s):

Model Selection ◽

Statistical Power ◽

Association Studies ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Selection Strategies ◽

Genome Wide

Download Full-text

Combining Multiple Hypothesis Testing with Machine Learning Increases the Statistical Power of Genome-wide Association Studies

Scientific Reports ◽

10.1038/srep36671 ◽

2016 ◽

Vol 6 (1) ◽

Cited By ~ 20

Author(s):

Bettina Mieth ◽

Marius Kloft ◽

Juan Antonio Rodríguez ◽

Sören Sonnenburg ◽

Robin Vobruba ◽

...

Keyword(s):

Machine Learning ◽

Hypothesis Testing ◽

Statistical Power ◽

Association Studies ◽

Multiple Hypothesis Testing ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Multiple Hypothesis ◽

Genome Wide

Download Full-text

Use of the Multivariate Discriminant Analysis for Genome-Wide Association Studies in Cattle

Animals ◽

10.3390/ani10081300 ◽

2020 ◽

Vol 10 (8) ◽

pp. 1300 ◽

Cited By ~ 1

Author(s):

Elisabetta Manca ◽

Alberto Cesarani ◽

Giustino Gaspa ◽

Silvia Sorbolini ◽

Nicolò P.P. Macciotta ◽

...

Keyword(s):

Discriminant Analysis ◽

Association Studies ◽

Real Data ◽

Genome Wide Association ◽

Stepwise Discriminant Analysis ◽

Genome Wide Association Studies ◽

Multivariate Method ◽

Genome Wide ◽

Single Marker ◽

Multivariate Gwas

Genome-wide association studies (GWAS) are traditionally carried out by using the single marker regression model that, if a small number of individuals is involved, often lead to very few associations. The Bayesian methods, such as BayesR, have obtained encouraging results when they are applied to the GWAS. However, these approaches, require that an a priori posterior inclusion probability threshold be fixed, thus arbitrarily affecting the obtained associations. To partially overcome these problems, a multivariate statistical algorithm was proposed. The basic idea was that animals with different phenotypic values of a specific trait share different allelic combinations for genes involved in its determinism. Three multivariate techniques were used to highlight the differences between the individuals assembled in high and low phenotype groups: the canonical discriminant analysis, the discriminant analysis and the stepwise discriminant analysis. The multivariate method was tested both on simulated and on real data. The results from the simulation study highlighted that the multivariate GWAS detected a greater number of true associated single nucleotide polymorphisms (SNPs) and Quantitative trait loci (QTLs) than the single marker model and the Bayesian approach. For example, with 3000 animals, the traditional GWAS highlighted only 29 significantly associated markers and 13 QTLs, whereas the multivariate method found 127 associated SNPs and 65 QTLs. The gap between the two approaches slowly decreased as the number of animals increased. The Bayesian method gave worse results than the other two. On average, with the real data, the multivariate GWAS found 108 associated markers for each trait under study and among them, around 63% SNPs were also found in the single marker approach. Among the top 118 associated markers, 76 SNPs harbored putative candidate genes.

Download Full-text

GWAPower: a statistical power calculation software for genome-wide association studies with quantitative traits

BMC Genetics ◽

10.1186/1471-2156-12-12 ◽

2011 ◽

Vol 12 (1) ◽

pp. 12 ◽

Cited By ~ 41

Author(s):

Sheng Feng ◽

Shengchu Wang ◽

Chia-Cheng Chen ◽

Lan Lan

Keyword(s):

Statistical Power ◽

Quantitative Traits ◽

Association Studies ◽

Genome Wide Association ◽

Power Calculation ◽

Genome Wide Association Studies ◽

Genome Wide ◽

Statistical Power Calculation ◽

Calculation Software

Download Full-text

Enrichment of statistical power for genome-wide association studies

BMC Biology ◽

10.1186/s12915-014-0073-5 ◽

2014 ◽

Vol 12 (1) ◽

Cited By ~ 42

Author(s):

Meng Li ◽

Xiaolei Liu ◽

Peter Bradbury ◽

Jianming Yu ◽

Yuan-Ming Zhang ◽

...

Keyword(s):

Statistical Power ◽

Association Studies ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Genome Wide

Download Full-text

Mixture model-based association analysis with case-control data in genome wide association studies

Statistical Applications in Genetics and Molecular Biology ◽

10.1515/sagmb-2016-0022 ◽

2017 ◽

Vol 16 (3) ◽

Author(s):

Fadhaa Ali ◽

Jian Zhang

Keyword(s):

Mixture Model ◽

Multiple Testing ◽

Hypothesis Test ◽

Association Studies ◽

Real Data ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Model Based ◽

Genome Wide ◽

The Individual

AbstractMultilocus haplotype analysis of candidate variants with genome wide association studies (GWAS) data may provide evidence of association with disease, even when the individual loci themselves do not. Unfortunately, when a large number of candidate variants are investigated, identifying risk haplotypes can be very difficult. To meet the challenge, a number of approaches have been put forward in recent years. However, most of them are not directly linked to the disease-penetrances of haplotypes and thus may not be efficient. To fill this gap, we propose a mixture model-based approach for detecting risk haplotypes. Under the mixture model, haplotypes are clustered directly according to their estimated disease penetrances. A theoretical justification of the above model is provided. Furthermore, we introduce a hypothesis test for haplotype inheritance patterns which underpin this model. The performance of the proposed approach is evaluated by simulations and real data analysis. The results show that the proposed approach outperforms an existing multiple testing method.

Download Full-text

Statistical testing and power analysis for brain-wide association study

10.1101/089870 ◽

2016 ◽

Cited By ~ 1

Author(s):

Weikang Gong ◽

Lin Wan ◽

Wenlian Lu ◽

Liang Ma ◽

Fan Cheng ◽

...

Keyword(s):

False Positive ◽

Power Analysis ◽

Statistical Power ◽

Spatial Information ◽

Association Studies ◽

False Positive Rate ◽

Gaussian Random Field ◽

Resting State Fmri ◽

Statistical Testing ◽

Positive Rate

AbstractThe identification of connexel-wise associations, which involves examining functional connectivities between pairwise voxels across the whole brain, is both statistically and computationally challenging. Although such a connexel-wise methodology has recently been adopted by brain-wide association studies (BWAS) to identify connectivity changes in several mental disorders, such as schizophrenia, autism and depression [Cheng et al., 2015a,b, 2016], the multiple correction and power analysis methods designed specifically for connexel-wise analysis are still lacking. Therefore, we herein report the development of a rigorous statistical framework for connexel-wise significance testing based on the Gaussian random field theory. It includes controlling the family-wise error rate (FWER) of multiple hypothesis testings using topological inference methods, and calculating power and sample size for a connexel-wise study. Our theoretical framework can control the false-positive rate accurately, as validated empirically using two resting-state fMRI datasets. Compared with Bonferroni correction and false discovery rate (FDR), it can reduce false-positive rate and increase statistical power by appropriately utilizing the spatial information of fMRI data. Importantly, our method considerably reduces the computational complexity of a permutation-or simulation-based approach, thus, it can efficiently tackle large datasets with ultra-high resolution images. The utility of our method is shown in a case-control study. Our approach can identify altered functional connectivities in a major depression disorder dataset, whereas existing methods failed. A software package is available at https://github.com/weikanggong/BWAS.

Download Full-text