stability selection
Recently Published Documents


TOTAL DOCUMENTS

45
(FIVE YEARS 16)

H-INDEX

13
(FIVE YEARS 1)

2021 ◽  
Author(s):  
Asma Nouira ◽  
Chloe-Agathe Azencott

Genome-Wide Association Studies, or GWAS, aim at finding Single Nucleotide Polymorphisms (SNPs) that are associated with a phenotype of interest. GWAS are known to suffer from the large dimensionality of the data with respect to the number of available samples. Other limiting factors include the dependency between SNPs, due to linkage disequilibrium (LD), and the need to account for population structure, that is to say, confounding due to genetic ancestry. We propose an efficient approach for the multivariate analysis of admixed GWAS data based on a multitask group Lasso formulation. Each task corresponds to a subpopulation of the data, and each group to an LD-block. This formulation alleviates the curse of dimensionality, and makes it possible to identify disease LD-blocks shared across populations/tasks, as well as some that are specific to one population/task. In addition, we use stability selection to increase the robustness of our approach. Finally, gap safe screening rules speed up computations enough that our method can run at a genome-wide scale. To our knowledge, this is the first framework for GWAS on admixed populations combining feature selection at the LD-groups level, a multitask approach to address population structure, stability selection, and safe screening rules. We show that our approach outperforms state-of-the-art methods on both a simulated and a real-world cancer datasets.


2021 ◽  
Vol 12 ◽  
Author(s):  
Lin Yuan ◽  
Tao Sun ◽  
Jing Zhao ◽  
Zhen Shen

Copy number variation (CNV) may contribute to the development of complex diseases. However, due to the complex mechanism of path association and the lack of sufficient samples, understanding the relationship between CNV and cancer remains a major challenge. The unprecedented abundance of CNV, gene, and disease label data provides us with an opportunity to design a new machine learning framework to predict potential disease-related CNVs. In this paper, we developed a novel machine learning approach, namely, IHI-BMLLR (Integrating Heterogeneous Information sources with Biweight Mid-correlation and L1-regularized Logistic Regression under stability selection), to predict the CNV-disease path associations by using a data set containing CNV, disease state labels, and gene data. CNVs, genes, and diseases are connected through edges and then constitute a biological association network. To construct a biological network, we first used a self-adaptive biweight mid-correlation (BM) formula to calculate correlation coefficients between CNVs and genes. Then, we used logistic regression with L1 penalty (LLR) function to detect genes related to disease. We added stability selection strategy, which can effectively reduce false positives, when using self-adaptive BM and LLR. Finally, a weighted path search algorithm was applied to find top D path associations and important CNVs. The experimental results on both simulation and prostate cancer data show that IHI-BMLLR is significantly better than two state-of-the-art CNV detection methods (i.e., CCRET and DPtest) under false-positive control. Furthermore, we applied IHI-BMLLR to prostate cancer data and found significant path associations. Three new cancer-related genes were discovered in the paths, and these genes need to be verified by biological research in the future.


Circulation ◽  
2021 ◽  
Vol 143 (Suppl_1) ◽  
Author(s):  
Joshua Elliott ◽  
Barbara Bodinier ◽  
Matthew Whitaker ◽  
Ioanna Tzoulaki ◽  
Paul Elliott ◽  
...  

Introduction: Studies of risk factors for severe/fatal COVID-19 to date may not have identified the optimal set of informative predictors. Hypothesis: Use of penalized regression with stability analysis may identify new, sparse sets of risk factors jointly associated with COVID-19 mortality. Methods: We investigated demographic, social, lifestyle, biological (lipids, cystatin C, vitamin D), medical (comorbidities, medications) and air pollution data from UK Biobank (N=473,574) in relation to linked COVID-19 mortality, and compared with non-COVID-19 mortality. We used penalized regression models (LASSO) with stability analysis (80% selection threshold from 1,000 models with 80% subsampling) to identify a sparse set of variables associated with COVID-19 mortality. Results: Among 43 variables considered by LASSO stability selection, cardiovascular disease, hypertension, diabetes, cystatin C, age, male sex and Black ethnicity were jointly predictive of COVID-19 mortality risk at 80% selection threshold (Figure). Of these, Black ethnicity and hypertension contributed to COVID-19 but not non-COVID-19 mortality. Conclusions: Use of LASSO stability selection identified a sparse set of predictors for COVID-19 mortality including cardiovascular disease, hypertension, diabetes and cystatin C, a marker of renal function that has also been implicated in atherogenesis and inflammation. These results indicate the importance of cardiometabolic comorbidities as predisposing factors for COVID-19 mortality. Hypertension was differentially highly selected for risk of COVID-19 mortality, suggesting the need for continued vigilance with good blood pressure control during the pandemic.


Circulation ◽  
2021 ◽  
Vol 143 (Suppl_1) ◽  
Author(s):  
Joshua Elliott ◽  
Matthew Whitaker ◽  
Barbara Bodinier ◽  
Paul Elliott ◽  
Ioanna Tzoulaki ◽  
...  

Introduction: Variable selection methods can provide an unbiased means of identifying informative predictors but have rarely been applied to CVD risk prediction. Hypothesis: Additional variables beyond those in pooled cohort equations may improve CVD risk prediction. Methods: Use of two complementary variable selection methods (LASSO stability selection, parametric, and survival random forests, non-parametric) to identify jointly informative sets of predictors for CVD risk and rank them in order of predictive accuracy. We used a prospective cohort (UK Biobank) of 304,839 participants aged 40-69 years at enrollment (2006—2010) without prior CVD, with follow-up to March 2017. Variables comprised those in pooled cohort equations with additional biochemistry and hematology data and polygenic risk scores for CVD. Outcomes were CVD hospitalization, procedure/operation or mortality. Data were sex-stratified and divided into independent variable selection (40%), training (30%) and test (30%) sets. Variable selection via penalized (LASSO) Cox regression with stability analysis. Variables ranked according to mean change in C statistic after variable permutation in survival random forests. Results: Mean age 55.9 years; 10,267 CVD events (6,277 men [59.0%]), median 8.1 years follow-up. The Figure summarizes results from LASSO stability selection. Jointly informative predictors for both men and women were cystatin C, apolipoprotein B, family history of coronary artery disease and polygenic risk score in addition to age, systolic blood pressure, antihypertensive use and current smoking used in pooled cohort equations. Other than variables already included in pooled cohort equations, cystatin C and apolipoprotein B ranked highest in random forests for men and for women. Conclusions: Use of two complementary data-driven variable selection methods identified variables more highly selected for CVD prediction beyond those included in pooled cohort equations.


2021 ◽  
Author(s):  
Lin Yuan ◽  
Tao Sun ◽  
Jing Zhao ◽  
Zhen Shen

Abstract Background: Copy number variation (CNV) may contribute to development of complex diseases. However, due to the complex mechanism of path association and the lack of sufficient samples, understanding the relationship between CNV and cancer remains a major challenge. The unprecedented abundance of CNV, gene and disease label data provide us with an opportunity to design a new machine learning framework to predict potential disease related CNVs.Results: In this paper, we developed a novel machine learning approach, namely IHI BMLLR (Integrating Heterogeneous Information sources with Biweight Mid correlation and L1 regularized Logistic Regression under stability selection), to predict the CNV disease path associations by using a data set containing CNV, disease state labels and gene data. CNVs, genes, and diseases are connected through edges, and then constitute a biological association network. To construct a biological network, we first used a self adaptive biweight mid correlation (BM) formula to calculate correlation coefficients between CNVs and genes. Then, we used logistic regression with L1 penalty (LLR) function to detect genes related to disease. We added stability selection strategy, which can effectively reduce false positives, when using self adaptive BM and LLR. Finally, a weighted path search algorithm was applied to find top D path associations and important CNVs.Conclusions: Compared with state of the art methods, IHI BMLLR discovers CNVs disease path associations by integrating analysis of CNV, gene expression and disease label data combined with stability selection strategy and weighted path search algorithm, thereby mining more information in the data sets, and improving the accuracy of obtained CNVs. The experimental results on both simulation and prostate cancer data show that IHI BMLLR is significantly better than two state of the art CNV detection methods (i.e., CCRET and DPtest) under false positive control. Furthermore, we applied IHI BMLLR to prostate cancer data and found significant path associations. Three new cancer related genes were discovered in the paths and these genes need to be verified by biological research in the future.


2020 ◽  
Vol 124 ◽  
pp. 103959 ◽  
Author(s):  
Kang K. Yan ◽  
Xiaofei Wang ◽  
Wendy W.T. Lam ◽  
Varut Vardhanabhuti ◽  
Anne W.M. Lee ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document