scholarly journals Practical Issues in Screening and Variable Selection in Genome-Wide Association Analysis

2014 ◽  
Vol 13s7 ◽  
pp. CIN.S16350 ◽  
Author(s):  
Sungyeon Hong ◽  
Yongkang Kim ◽  
Taesung Park

Variable selection methods play an important role in high-dimensional statistical modeling and analysis. Computational cost and estimation accuracy are the two main concerns for statistical inference from ultrahigh-dimensional data. In particular, genome-wide association studies (GWAS), which focus on identifying single nucleotide polymorphisms (SNPs) associated with a disease of interest, have produced ultrahigh-dimensional data. Numerous methods have been proposed to handle GWAS data. Most statistical methods have adopted a two-stage approach: pre-screening for dimensional reduction and variable selection to identify causal SNPs. The pre-screening step selects SNPs in terms of their P-values or the absolute values of the regression coefficients in single SNP analysis. Penalized regressions, such as the ridge, lasso, adaptive lasso, and elastic-net regressions, are commonly used for the variable selection step. In this paper, we investigate which combination of pre-screening method and penalized regression performs best on a quantitative phenotype using two real GWAS datasets.

2016 ◽  
Vol 32 (10) ◽  
pp. 1493-1501 ◽  
Author(s):  
Christian Benner ◽  
Chris C.A. Spencer ◽  
Aki S. Havulinna ◽  
Veikko Salomaa ◽  
Samuli Ripatti ◽  
...  

2017 ◽  
Author(s):  
Haohan Wang ◽  
Bryon Aragam ◽  
Eric P. Xing

AbstractA fundamental and important challenge in modern datasets of ever increasing dimensionality is variable selection, which has taken on renewed interest recently due to the growth of biological and medical datasets with complex, non-i.i.d. structures. Naïvely applying classical variable selection methods such as the Lasso to such datasets may lead to a large number of false discoveries. Motivated by genome-wide association studies in genetics, we study the problem of variable selection for datasets arising from multiple subpopulations, when this underlying population structure is unknown to the researcher. We propose a unified framework for sparse variable selection that adaptively corrects for population structure via a low-rank linear mixed model. Most importantly, the proposed method does not require prior knowledge of sample structure in the data and adaptively selects a covariance structure of the correct complexity. Through extensive experiments, we illustrate the effectiveness of this framework over existing methods. Further, we test our method on three different genomic datasets from plants, mice, and human, and discuss the knowledge we discover with our method.


2018 ◽  
Author(s):  
Ping Zeng ◽  
Xinjie Hao ◽  
Xiang Zhou

AbstractMotivationGenome-wide association studies (GWASs) have identified many genetic loci associated with complex traits. A substantial fraction of these identified loci are associated with multiple traits – a phenomena known as pleiotropy. Identification of pleiotropic associations can help characterize the genetic relationship among complex traits and can facilitate our understanding of disease etiology. Effective pleiotropic association mapping requires the development of statistical methods that can jointly model multiple traits with genome-wide SNPs together.ResultsWe develop a joint modeling method, which we refer to as the integrative MApping of Pleiotropic association (iMAP). iMAP models summary statistics from GWASs, uses a multivariate Gaussian distribution to account for phenotypic correlation, simultaneously infers genome-wide SNP association pattern using mixture modeling, and has the potential to reveal causal relationship between traits. Importantly, iMAP integrates a large number of SNP functional annotations to substantially improve association mapping power, and, with a sparsity-inducing penalty, is capable of selecting informative annotations from a large, potentially noninformative set. To enable scalable inference of iMAP to association studies with hundreds of thousands of individuals and millions of SNPs, we develop an efficient expectation maximization algorithm based on an approximate penalized regression algorithm. With simulations and comparisons to existing methods, we illustrate the benefits of iMAP both in terms of high association mapping power and in terms of accurate estimation of genome-wide SNP association patterns. Finally, we apply iMAP to perform a joint analysis of 48 traits from 31 GWAS consortia together with 40 tissue-specific SNP annotations generated from the Roadmap Project. iMAP is freely available at www.xzlab.org/software.html.


2019 ◽  
Vol 5 (1) ◽  
Author(s):  
Benazir Rowe ◽  
Xiangning Chen ◽  
Zuoheng Wang ◽  
Jingchun Chen ◽  
Amei Amei

AbstractGenome-wide association studies (GWAS) have identified over 100 loci associated with schizophrenia. Most of these studies test genetic variants for association one at a time. In this study, we performed GWAS of the molecular genetics of schizophrenia (MGS) dataset with 5334 subjects using multivariate Bayesian variable selection (BVS) method Posterior Inference via Model Averaging and Subset Selection (piMASS) and compared our results with the previous univariate analysis of the MGS dataset. We showed that piMASS can improve the power of detecting schizophrenia-associated SNPs, potentially leading to new discoveries from existing data without increasing the sample size. We tested SNPs in groups to allow for local additive effects and used permutation test to determine statistical significance in order to compare our results with univariate method. The previous univariate analysis of the MGS dataset revealed no genome-wide significant loci. Using the same dataset, we identified a single region that exceeded the genome-wide significance. The result was replicated using an independent Swedish Schizophrenia Case–Control Study (SSCCS) dataset. Based on the SZGR 2.0 database we found 63 SNPs from the best performing regions that are mapped to 27 genes known to be associated with schizophrenia. Overall, we demonstrated that piMASS could discover association signals that otherwise would need a much larger sample size. Our study has important implication that reanalyzing published datasets with BVS methods like piMASS might have more power to discover new risk variants for many diseases without new sample collection, ascertainment, and genotyping.


Circulation ◽  
2008 ◽  
Vol 118 (suppl_18) ◽  
Author(s):  
Peter Langfelder ◽  
Margarete Mehrabian ◽  
Eric E Schadt ◽  
Aldons J Lusis ◽  
Steve Horvath

The genetic and environmental factors contributing to HDL-cholesterol levels are highly complex. For example, a recent meta-analysis of three genome wide association studies (GWAS), consisting of over 9000 individuals, revealed several loci, but altogether these explained less than 10% of HDL variation. Since HDL has a heritability of about 50%, there clearly must be many as yet unidentified factors. To better address this complexity, we have utilized integrative genomic approaches to relate common DNA variation to gene networks and HDL metabolism. We report a Weighted Gene Co-expression Network Analysis (WGCNA) of genome-wide expression data from a CAST X C57BL6/J F2 intercross. WGCNA is a systems-based gene expression analysis and gene screening method. It utilizes co-expression patterns among genes to identify gene modules (groups of highly co-expressed genes) significantly associated with a clinical trait, in this case plasma HDL levels. Co-expression modules may represent cellular processes and interacting pathways that provide a bridge between individual genes and a systems-level view of the organism. A module-centric analysis effectively alleviates the multiple testing problems inherent in microarray data analysis and can be considered a biologically motivated data-reduction scheme. Using data from liver and adipose tissues, we have identified several modules strongly associated with plasma HDL levels (p-values ranging from below 1e-20 to 1e-5). Gene ontology and functional enrichment analysis indicate that these modules are indeed biologically meaningful. The modules contain variants of several genes under loci that were recently implicated by three GWA studies: liver modules include GCKR, ANGPTL4, ABCA3, APOA1, and APOA4, while the adipose modules include ABCA6, ANGPTL11 and 12, MMAB, MLXIPL, SORT1, PBX4, PLTP, and APOL6. Thus, our study also serves to help identify likely candidates from GWAS. In conclusion, applying WGCNA methods reveals modules that are biologically meaningful, statistically significant, and enriched for genes and pathways related to HDL metabolism and transport.


Sign in / Sign up

Export Citation Format

Share Document