Practical Issues in Screening and Variable Selection in Genome-Wide Association Analysis

Cancer Informatics ◽

10.4137/cin.s16350 ◽

2014 ◽

Vol 13s7 ◽

pp. CIN.S16350 ◽

Cited By ~ 2

Author(s):

Sungyeon Hong ◽

Yongkang Kim ◽

Taesung Park

Keyword(s):

Variable Selection ◽

Association Studies ◽

Screening Method ◽

Computational Cost ◽

Penalized Regression ◽

Adaptive Lasso ◽

Genome Wide Association ◽

Snp Analysis ◽

Genome Wide Association Studies ◽

Genome Wide

Variable selection methods play an important role in high-dimensional statistical modeling and analysis. Computational cost and estimation accuracy are the two main concerns for statistical inference from ultrahigh-dimensional data. In particular, genome-wide association studies (GWAS), which focus on identifying single nucleotide polymorphisms (SNPs) associated with a disease of interest, have produced ultrahigh-dimensional data. Numerous methods have been proposed to handle GWAS data. Most statistical methods have adopted a two-stage approach: pre-screening for dimensional reduction and variable selection to identify causal SNPs. The pre-screening step selects SNPs in terms of their P-values or the absolute values of the regression coefficients in single SNP analysis. Penalized regressions, such as the ridge, lasso, adaptive lasso, and elastic-net regressions, are commonly used for the variable selection step. In this paper, we investigate which combination of pre-screening method and penalized regression performs best on a quantitative phenotype using two real GWAS datasets.

A Bayesian Regression Model with Variable Selection for Genome-Wide Association Studies

Case Studies in Bayesian Statistical Modelling and Analysis - Wiley Series in Probability and Statistics ◽

10.1002/9781118394472.ch6 ◽

2012 ◽

pp. 103-117

Author(s):

Carla Chen ◽

Kerrie L. Mengersen ◽

Katja Ickstadt ◽

Jonathan M. Keith

Keyword(s):

Variable Selection ◽

Regression Model ◽

Association Studies ◽

Bayesian Regression ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Genome Wide ◽

Selection For

Variable selection in heterogeneous datasets: A truncated-rank sparse linear mixed model with applications to genome-wide association studies

Methods ◽

10.1016/j.ymeth.2018.04.021 ◽

2018 ◽

Vol 145 ◽

pp. 2-9 ◽

Cited By ~ 1

Author(s):

Haohan Wang ◽

Bryon Aragam ◽

Eric P. Xing

Keyword(s):

Variable Selection ◽

Mixed Model ◽

Linear Mixed Model ◽

Association Studies ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Genome Wide ◽

Heterogeneous Datasets

Variable selection in statistical models using population-based incremental learning with applications to genome-wide association studies

2012 IEEE Congress on Evolutionary Computation ◽

10.1109/cec.2012.6256577 ◽

2012 ◽

Author(s):

Hien Duy Nguyen ◽

Ian A. Wood

Keyword(s):

Variable Selection ◽

Statistical Models ◽

Incremental Learning ◽

Association Studies ◽

Population Based ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Genome Wide

Variable selection in heterogeneous datasets: A truncated-rank sparse linear mixed model with applications to genome-wide association studies

2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) ◽

10.1109/bibm.2017.8217687 ◽

2017 ◽

Cited By ~ 9

Author(s):

Haohan Wang ◽

Bryon Aragam ◽

Eric P. Xing

Keyword(s):

Variable Selection ◽

Mixed Model ◽

Linear Mixed Model ◽

Association Studies ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Genome Wide ◽

Heterogeneous Datasets

FINEMAP: efficient variable selection using summary data from genome-wide association studies

Bioinformatics ◽

10.1093/bioinformatics/btw018 ◽

2016 ◽

Vol 32 (10) ◽

pp. 1493-1501 ◽

Cited By ~ 215

Author(s):

Christian Benner ◽

Chris C.A. Spencer ◽

Aki S. Havulinna ◽

Veikko Salomaa ◽

Samuli Ripatti ◽

...

Keyword(s):

Variable Selection ◽

Association Studies ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Genome Wide ◽

Summary Data

Variable Selection in Heterogeneous Datasets: A Truncated-rank Sparse Linear Mixed Model with Applications to Genome-wide Association Studies

10.1101/228106 ◽

2017 ◽

Cited By ~ 2

Author(s):

Haohan Wang ◽

Bryon Aragam ◽

Eric P. Xing

Keyword(s):

Population Structure ◽

Variable Selection ◽

Mixed Model ◽

Linear Mixed Model ◽

Association Studies ◽

Genome Wide Association ◽

Low Rank ◽

Genome Wide Association Studies ◽

Unified Framework ◽

Genome Wide

AbstractA fundamental and important challenge in modern datasets of ever increasing dimensionality is variable selection, which has taken on renewed interest recently due to the growth of biological and medical datasets with complex, non-i.i.d. structures. Naïvely applying classical variable selection methods such as the Lasso to such datasets may lead to a large number of false discoveries. Motivated by genome-wide association studies in genetics, we study the problem of variable selection for datasets arising from multiple subpopulations, when this underlying population structure is unknown to the researcher. We propose a unified framework for sparse variable selection that adaptively corrects for population structure via a low-rank linear mixed model. Most importantly, the proposed method does not require prior knowledge of sample structure in the data and adaptively selects a covariance structure of the correct complexity. Through extensive experiments, we illustrate the effectiveness of this framework over existing methods. Further, we test our method on three different genomic datasets from plants, mice, and human, and discuss the knowledge we discover with our method.

Pleiotropic Mapping and Annotation Selection in Genome-wide Association Studies with Penalized Gaussian Mixture Models

10.1101/256461 ◽

2018 ◽

Author(s):

Ping Zeng ◽

Xinjie Hao ◽

Xiang Zhou

Keyword(s):

Association Mapping ◽

Complex Traits ◽

Association Studies ◽

Penalized Regression ◽

Genome Wide Association ◽

Accurate Estimation ◽

Genome Wide Association Studies ◽

Multiple Traits ◽

Snp Association ◽

Genome Wide

AbstractMotivationGenome-wide association studies (GWASs) have identified many genetic loci associated with complex traits. A substantial fraction of these identified loci are associated with multiple traits – a phenomena known as pleiotropy. Identification of pleiotropic associations can help characterize the genetic relationship among complex traits and can facilitate our understanding of disease etiology. Effective pleiotropic association mapping requires the development of statistical methods that can jointly model multiple traits with genome-wide SNPs together.ResultsWe develop a joint modeling method, which we refer to as the integrative MApping of Pleiotropic association (iMAP). iMAP models summary statistics from GWASs, uses a multivariate Gaussian distribution to account for phenotypic correlation, simultaneously infers genome-wide SNP association pattern using mixture modeling, and has the potential to reveal causal relationship between traits. Importantly, iMAP integrates a large number of SNP functional annotations to substantially improve association mapping power, and, with a sparsity-inducing penalty, is capable of selecting informative annotations from a large, potentially noninformative set. To enable scalable inference of iMAP to association studies with hundreds of thousands of individuals and millions of SNPs, we develop an efficient expectation maximization algorithm based on an approximate penalized regression algorithm. With simulations and comparisons to existing methods, we illustrate the benefits of iMAP both in terms of high association mapping power and in terms of accurate estimation of genome-wide SNP association patterns. Finally, we apply iMAP to perform a joint analysis of 48 traits from 31 GWAS consortia together with 40 tissue-specific SNP annotations generated from the Roadmap Project. iMAP is freely available at www.xzlab.org/software.html.

Biological and practical implications of genome-wide association study of schizophrenia using Bayesian variable selection

npj Schizophrenia ◽

10.1038/s41537-019-0088-6 ◽

2019 ◽

Vol 5 (1) ◽

Author(s):

Benazir Rowe ◽

Xiangning Chen ◽

Zuoheng Wang ◽

Jingchun Chen ◽

Amei Amei

Keyword(s):

Variable Selection ◽

Sample Size ◽

Permutation Test ◽

Association Studies ◽

Statistical Significance ◽

Univariate Analysis ◽

Bayesian Variable Selection ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Genome Wide

AbstractGenome-wide association studies (GWAS) have identified over 100 loci associated with schizophrenia. Most of these studies test genetic variants for association one at a time. In this study, we performed GWAS of the molecular genetics of schizophrenia (MGS) dataset with 5334 subjects using multivariate Bayesian variable selection (BVS) method Posterior Inference via Model Averaging and Subset Selection (piMASS) and compared our results with the previous univariate analysis of the MGS dataset. We showed that piMASS can improve the power of detecting schizophrenia-associated SNPs, potentially leading to new discoveries from existing data without increasing the sample size. We tested SNPs in groups to allow for local additive effects and used permutation test to determine statistical significance in order to compare our results with univariate method. The previous univariate analysis of the MGS dataset revealed no genome-wide significant loci. Using the same dataset, we identified a single region that exceeded the genome-wide significance. The result was replicated using an independent Swedish Schizophrenia Case–Control Study (SSCCS) dataset. Based on the SZGR 2.0 database we found 63 SNPs from the best performing regions that are mapped to 27 genes known to be associated with schizophrenia. Overall, we demonstrated that piMASS could discover association signals that otherwise would need a much larger sample size. Our study has important implication that reanalyzing published datasets with BVS methods like piMASS might have more power to discover new risk variants for many diseases without new sample collection, ascertainment, and genotyping.

FORGE: multivariate calculation of gene-wide p-values from Genome-Wide Association Studies Authors and Affiliations

10.1101/023648 ◽

2015 ◽

Cited By ~ 2

Author(s):

Inti Inal Pedroso ◽

Michael R Barnes ◽

Anbarasu Lourdusamy ◽

Ammar Al-Chalabi ◽

Gerome Breen

Keyword(s):

Statistical Power ◽

Association Studies ◽

Single Point ◽

Genome Wide Association ◽

P Value ◽

Disease Genes ◽

Snp Analysis ◽

Genome Wide Association Studies ◽

P Values ◽

Genome Wide

Genome-wide association studies (GWAS) have proven a valuable tool to explore the genetic basis of many traits. However, many GWAS lack statistical power and the commonly used single-point analysis method needs to be complemented to enhance power and interpretation. Multivariate region or gene-wide association are an alternative, allowing for identification of disease genes in a manner more robust to allelic heterogeneity. Gene-based association also facilitates systems biology analyses by generating a single p-value per gene. We have designed and implemented FORGE, a software suite which implements a range of methods for the combination of p-values for the individual genetic variants within a gene or genomic region. The software can be used with summary statistics (marker ids and p-values) and accepts as input the result file formats of commonly used genetic association software. When applied to a study of Crohn's disease susceptibility, it identified all genes found by single SNP analysis and additional genes identified by large independent meta-analysis. FORGE p-values on gene-set analyses highlighted association with the Jak-STAT and cytokine signalling pathways, both previously associated with CD. We highlight the software's main features, its future development directions and provide a comparison with alternative available software tools. FORGE can be freely accessed at https://github.com/inti/FORGE.

Abstract 1444: Weighted Gene Co-expression Network Analysis of Adipose and Liver Reveals Gene Modules Related to Plasma HDL Levels and Containing Candidate Genes at Loci Identified in Genome Wide Association Studies

Circulation ◽

10.1161/circ.118.suppl_18.s_327 ◽

2008 ◽

Vol 118 (suppl_18) ◽

Author(s):

Peter Langfelder ◽

Margarete Mehrabian ◽

Eric E Schadt ◽

Aldons J Lusis ◽

Steve Horvath

Keyword(s):

Network Analysis ◽

Association Studies ◽

Expression Patterns ◽

Screening Method ◽

Functional Enrichment ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Genome Wide ◽

Gene Modules ◽

Hdl Metabolism

The genetic and environmental factors contributing to HDL-cholesterol levels are highly complex. For example, a recent meta-analysis of three genome wide association studies (GWAS), consisting of over 9000 individuals, revealed several loci, but altogether these explained less than 10% of HDL variation. Since HDL has a heritability of about 50%, there clearly must be many as yet unidentified factors. To better address this complexity, we have utilized integrative genomic approaches to relate common DNA variation to gene networks and HDL metabolism. We report a Weighted Gene Co-expression Network Analysis (WGCNA) of genome-wide expression data from a CAST X C57BL6/J F2 intercross. WGCNA is a systems-based gene expression analysis and gene screening method. It utilizes co-expression patterns among genes to identify gene modules (groups of highly co-expressed genes) significantly associated with a clinical trait, in this case plasma HDL levels. Co-expression modules may represent cellular processes and interacting pathways that provide a bridge between individual genes and a systems-level view of the organism. A module-centric analysis effectively alleviates the multiple testing problems inherent in microarray data analysis and can be considered a biologically motivated data-reduction scheme. Using data from liver and adipose tissues, we have identified several modules strongly associated with plasma HDL levels (p-values ranging from below 1e-20 to 1e-5). Gene ontology and functional enrichment analysis indicate that these modules are indeed biologically meaningful. The modules contain variants of several genes under loci that were recently implicated by three GWA studies: liver modules include GCKR, ANGPTL4, ABCA3, APOA1, and APOA4, while the adipose modules include ABCA6, ANGPTL11 and 12, MMAB, MLXIPL, SORT1, PBX4, PLTP, and APOL6. Thus, our study also serves to help identify likely candidates from GWAS. In conclusion, applying WGCNA methods reveals modules that are biologically meaningful, statistically significant, and enriched for genes and pathways related to HDL metabolism and transport.