scholarly journals A group bridge approach for variable selection

Biometrika ◽  
2009 ◽  
Vol 96 (2) ◽  
pp. 339-355 ◽  
Author(s):  
Jian Huang ◽  
Shuange Ma ◽  
Huiliang Xie ◽  
Cun-Hui Zhang

Abstract In multiple regression problems when covariates can be naturally grouped, it is important to carry out feature selection at the group and within-group individual variable levels simultaneously. The existing methods, including the lasso and group lasso, are designed for either variable selection or group selection, but not for both. We propose a group bridge approach that is capable of simultaneous selection at both the group and within-group individual variable levels. The proposed approach is a penalized regularization method that uses a specially designed group bridge penalty. It has the oracle group selection property, in that it can correctly select important groups with probability converging to one. In contrast, the group lasso and group least angle regression methods in general do not possess such an oracle property in group selection. Simulation studies indicate that the group bridge has superior performance in group and individual variable selection relative to several existing methods.

2019 ◽  
Vol 39 (9) ◽  
pp. 0930002
Author(s):  
李冠稳 Guanwen Li ◽  
高小红 Xiaohong Gao ◽  
肖能文 Nengwen Xiao ◽  
肖云飞 Yunfei Xiao

2021 ◽  
Vol 1 (1) ◽  
pp. 41-55
Author(s):  
Kadriye Hilal Topal

The quality of education is crucial for its competitiveness in the developing world. International tests are organized at regular intervals to measure the quality of education and to see the place in the ranking of countries. The surveys on these examinations have provided a large number of variables that can be effective on the scores of the tests, including family, teacher, school and course equipment and information communication technologies, etc. The important question is which variables are relevant for the students' achievement in these tests. We investigated the barriers of mathematics success of Turkish students in the TIMSS exam and compared their status with Singaporean students who took part in at top of the ranking in the exam. For this, we employed the adaptive elastic net which is one of the regularized regression methods to dataset and compared their prediction accuracy according to three different alpha levels [0.1; 0.5; 0.9] to determine the model that has high variable selection ability with optimal prediction. The adaptive elastic net with the alpha level [0.9] was selected as superior to others. As the findings, a technology-oriented education system can help to success of the students in Turkey and the countries having similar experiences in international tests.


2019 ◽  
Author(s):  
Junyang Qian ◽  
Yosuke Tanigawa ◽  
Wenfei Du ◽  
Matthew Aguirre ◽  
Chris Chang ◽  
...  

AbstractThe UK Biobank (Bycroft et al., 2018) is a very large, prospective population-based cohort study across the United Kingdom. It provides unprecedented opportunities for researchers to investigate the relationship between genotypic information and phenotypes of interest. Multiple regression methods, compared with GWAS, have already been showed to greatly improve the prediction performance for a variety of phenotypes. In the high-dimensional settings, the lasso (Tibshirani, 1996), since its first proposal in statistics, has been proved to be an effective method for simultaneous variable selection and estimation. However, the large scale and ultrahigh dimension seen in the UK Biobank pose new challenges for applying the lasso method, as many existing algorithms and their implementations are not scalable to large applications. In this paper, we propose a computational framework called batch screening iterative lasso (BASIL) that can take advantage of any existing lasso solver and easily build a scalable solution for very large data, including those that are larger than the memory size. We introduce snpnet, an R package that implements the proposed algorithm on top of glmnet (Friedman et al., 2010a) and optimizes for single nucleotide polymorphism (SNP) datasets. It currently supports ℓ1-penalized linear model, logistic regression, Cox model, and also extends to the elastic net with ℓ1/ℓ2 penalty. We demonstrate results on the UK Biobank dataset, where we achieve superior predictive performance on quantitative and qualitative traits including height, body mass index, asthma and high cholesterol.Author SummaryWith the advent and evolution of large-scale and comprehensive biobanks, there come up unprecedented opportunities for researchers to further uncover the complex landscape of human genetics. One major direction that attracts long-standing interest is the investigation of the relationships between genotypes and phenotypes. This includes but doesn’t limit to the identification of genotypes that are significantly associated with the phenotypes, and the prediction of phenotypic values based on the genotypic information. Genome-wide association studies (GWAS) is a very powerful and widely used framework for the former task, having produced a number of very impactful discoveries. However, when it comes to the latter, its performance is fairly limited by the univariate nature. To address this, multiple regression methods have been suggested to fill in the gap. That said, challenges emerge as the dimension and the size of datasets both become large nowadays. In this paper, we present a novel computational framework that enables us to solve efficiently the entire lasso or elastic-net solution path on large-scale and ultrahigh-dimensional data, and therefore make simultaneous variable selection and prediction. Our approach can build on any existing lasso solver for small or moderate-sized problems, scale it up to a big-data solution, and incorporate other extensions easily. We provide a package snpnet that extends the glmnet package in R and optimizes for large phenotype-genotype data. On the UK Biobank, we observe improved prediction performance on height, body mass index (BMI), asthma and high cholesterol by the lasso over other univariate and multiple regression methods. That said, the scope of our approach goes beyond genetic studies. It can be applied to general sparse regression problems and build scalable solution for a variety of distribution families based on existing solvers.


2020 ◽  
Vol 10 (12) ◽  
pp. 4439-4448
Author(s):  
Zigui Wang ◽  
Deborah Chapman ◽  
Gota Morota ◽  
Hao Cheng

Bayesian regression methods that incorporate different mixture priors for marker effects are used in multi-trait genomic prediction. These methods can also be extended to genome-wide association studies (GWAS). In multiple-trait GWAS, incorporating the underlying causal structures among traits is essential for comprehensively understanding the relationship between genotypes and traits of interest. Therefore, we develop a GWAS methodology, SEM-Bayesian alphabet, which, by applying the structural equation model (SEM), can be used to incorporate causal structures into multi-trait Bayesian regression methods. SEM-Bayesian alphabet provides a more comprehensive understanding of the genotype-phenotype mapping than multi-trait GWAS by performing GWAS based on indirect, direct and overall marker effects. The superior performance of SEM-Bayesian alphabet was demonstrated by comparing its GWAS results with other similar multi-trait GWAS methods on real and simulated data. The software tool JWAS offers open-source routines to perform these analyses.


Author(s):  
I. Tsamardinos ◽  
G. Borboudakis ◽  
E. G. Christodoulou ◽  
O. D. Røe

The chemosensitivity of tumours to specific drugs can be predicted based on molecular quantities, such as gene expressions, miRNA expressions, and protein concentrations. This finding is important for improving drug efficacy and personalizing drug use. In this paper, the authors present an analysis strategy that, compared to prior work, retains more information in the data for analysis and may lead to improved chemosensitivity prediction. The authors apply improved methods for estimating the GI50 value of a drug (an indicator of the response to the drug), regression methods for constructing predictive models of the GI50 value, advanced variable selection techniques, such as MMPC, and a multi-task variable selection technique for identifying a small-size signature that is simultaneously predictive for several drugs and cell lines. The methods are applied on gene expression, miRNA expression, and proteomics data from 53 tumour cell lines after treatment with 120 drugs, obtained from the National Cancer Institute databases. A biological interpretation and discussion of the results is presented for the most clinically important subset of 14 drugs.


Author(s):  
Jiamin Zhao ◽  
Yang Yu ◽  
Xu Wang ◽  
Shihan Ma ◽  
Xinjun Sheng ◽  
...  

Abstract Objective. Musculoskeletal model (MM) driven by electromyography (EMG) signals has been identified as a promising approach to predicting human motions in the control of prostheses and robots. However, muscle excitations in MMs are generally derived from the EMG signals of the targeted sensor covering the muscle, inconsistent with the fact that signals of a sensor are from multiple muscles considering signal crosstalk in actual situation. To identify more accurate muscle excitations for MM in the presence of crosstalk, we proposed a novel excitation-extracting method inspired by muscle synergy for simultaneously estimating hand and wrist movements. Approach. Muscle excitations were firstly extracted using a two-step muscle synergy-derived method. Specifically, we calculated subject-specific muscle weighting matrix and corresponding profiles according to contributions of different muscles for movements derived from synergistic motion relation. Then, the improved excitations were used to simultaneously estimate hand and wrist movements through musculoskeletal modeling. Moreover, the offline comparison among the proposed method, traditional MM and regression methods, and an online test of the proposed method were conducted. Main results. The offline experiments demonstrated that the proposed approach outperformed the EMG envelope-driven MM and three regression models with higher R and lower NRMSE. Furthermore, the comparison of excitations of two MMs validated the effectiveness of the proposed approach in extracting muscle excitations in the presence of crosstalk. The online test further indicated the superior performance of the proposed method than the MM driven by EMG envelopes. Significance. The proposed excitation-extracting method identified more accurate neural commands for MMs, providing a promising approach in rehabilitation and robot control to model the transformation from surface EMG to joint kinematics.


2013 ◽  
Vol 12 ◽  
pp. CIN.S10212 ◽  
Author(s):  
Lingkang Huang ◽  
Hao Helen Zhang ◽  
Zhao-Bang Zeng ◽  
Pierre R. Bushel

Background Microarray techniques provide promising tools for cancer diagnosis using gene expression profiles. However, molecular diagnosis based on high-throughput platforms presents great challenges due to the overwhelming number of variables versus the small sample size and the complex nature of multi-type tumors. Support vector machines (SVMs) have shown superior performance in cancer classification due to their ability to handle high dimensional low sample size data. The multi-class SVM algorithm of Crammer and Singer provides a natural framework for multi-class learning. Despite its effective performance, the procedure utilizes all variables without selection. In this paper, we propose to improve the procedure by imposing shrinkage penalties in learning to enforce solution sparsity. Results The original multi-class SVM of Crammer and Singer is effective for multi-class classification but does not conduct variable selection. We improved the method by introducing soft-thresholding type penalties to incorporate variable selection into multi-class classification for high dimensional data. The new methods were applied to simulated data and two cancer gene expression data sets. The results demonstrate that the new methods can select a small number of genes for building accurate multi-class classification rules. Furthermore, the important genes selected by the methods overlap significantly, suggesting general agreement among different variable selection schemes. Conclusions High accuracy and sparsity make the new methods attractive for cancer diagnostics with gene expression data and defining targets of therapeutic intervention. Availability The source MATLAB code are available from http://math.arizona.edu/∼hzhang/software.html.


Sign in / Sign up

Export Citation Format

Share Document