scholarly journals Exploring Pathway-Based Group Lasso for Cancer Survival Analysis: A Special Case of Multi-Task Learning

2021 ◽  
Vol 12 ◽  
Author(s):  
Gabriela Malenová ◽  
Daniel Rowson ◽  
Valentina Boeva

Motivation: The Cox proportional hazard models are widely used in the study of cancer survival. However, these models often meet challenges such as the large number of features and small sample sizes of cancer data sets. While this issue can be partially solved by applying regularization techniques such as lasso, the models still suffer from unsatisfactory predictive power and low stability.Methods: Here, we investigated two methods to improve survival models. Firstly, we leveraged the biological knowledge that groups of genes act together in pathways and regularized both at the group and gene level using latent group lasso penalty term. Secondly, we designed and applied a multi-task learning penalty that allowed us leveraging the relationship between survival models for different cancers.Results: We observed modest improvements over the simple lasso model with the inclusion of latent group lasso penalty for six of the 16 cancer types tested. The addition of a multi-task penalty, which penalized coefficients in pairs of cancers from diverging too greatly, significantly improved accuracy for a single cancer, lung squamous cell carcinoma, while having minimal effect on other cancer types.Conclusion: While the use of pathway information and multi-tasking shows some promise, these methods do not provide a substantial improvement when compared with standard methods.

2013 ◽  
Vol 58 (2) ◽  
pp. 381-407 ◽  
Author(s):  
Silvia Villa ◽  
Lorenzo Rosasco ◽  
Sofia Mosci ◽  
Alessandro Verri

2020 ◽  
Vol 19 ◽  
pp. 117693512090739
Author(s):  
Sarah Samorodnitsky ◽  
Katherine A Hoadley ◽  
Eric F Lock

We built a novel Bayesian hierarchical survival model based on the somatic mutation profile of patients across 50 genes and 27 cancer types. The pan-cancer quality allows for the model to “borrow” information across cancer types, motivated by the assumption that similar mutation profiles may have similar (but not necessarily identical) effects on survival across different tissues of origin or tumor types. The effect of a mutation at each gene was allowed to vary by cancer type, whereas the mean effect of each gene was shared across cancers. Within this framework, we considered 4 parametric survival models (normal, log-normal, exponential, and Weibull), and we compared their performance via a cross-validation approach in which we fit each model on training data and estimate the log-posterior predictive likelihood on test data. The log-normal model gave the best fit, and we investigated the partial effect of each gene on survival via a forward selection procedure. Through this we determined that mutations at TP53 and FAT4 were together the most useful for predicting patient survival. We validated the model via simulation to ensure that our algorithm for posterior computation gave nominal coverage rates. The code used for this analysis can be found at https://github.com/sarahsamorodnitsky/Pan-Cancer-Survival-Modeling.git , and the results are summarized at http://ericfrazerlock.com/surv_figs/SurvivalDisplay.html .


PLoS ONE ◽  
2020 ◽  
Vol 15 (11) ◽  
pp. e0241225
Author(s):  
Andre Goncalves ◽  
Braden Soper ◽  
Mari Nygård ◽  
Jan F. Nygård ◽  
Priyadip Ray ◽  
...  

Oncology is a highly siloed field of research in which sub-disciplinary specialization has limited the amount of information shared between researchers of distinct cancer types. This can be attributed to legitimate differences in the physiology and carcinogenesis of cancers affecting distinct anatomical sites. However, underlying processes that are shared across seemingly disparate cancers probably affect prognosis. The objective of the current study is to investigate whether multitask learning improves 5-year survival cancer patient survival prediction by leveraging information across anatomically distinct HPV related cancers. Data were obtained from the Surveillance, Epidemiology, and End Results (SEER) program database. The study cohort consisted of 29,768 primary cancer cases diagnosed in the United States between 2004 and 2015. Ten different cancer diagnoses were selected, all with a known association with HPV risk. In the analysis, the cancer diagnoses were categorized into three distinct topography groups of varying specificity. The most specific topography grouping consisted of 10 original cancer diagnoses differentiated by the first two digits of the ICD-O-3 topography code. The second topography grouping consisted of cancer diagnoses categorized into six distinct organ groups. Finally, the third topography grouping consisted of just two groups, head-neck cancers and ano-genital cancers. The tasks were to predict 5-year survival for patients within the different topography groups using 14 predictive features which were selected among descriptive variables available in the SEER database. The information from the predictive features was shared between tasks in three different ways, resulting in three distinct predictive models: 1) Information was not shared between patients assigned to different tasks (single task learning); 2) Information was shared between all patients, regardless of task (pooled model); 3) Only relevant information was shared between patients grouped to different tasks (multitask learning). Prediction performance was evaluated with Brier scores. All three models were evaluated against one another on each of the three distinct topography-defined tasks. The results showed that multitask classifiers achieved relative improvement for the majority of the scenarios studied compared to single task learning and pooled baseline methods. In this study, we have demonstrated that sharing information among anatomically distinct cancer types can lead to improved predictive survival models.


2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Erik van Dijk ◽  
Tom van den Bosch ◽  
Kristiaan J. Lenos ◽  
Khalid El Makrini ◽  
Lisanne E. Nijman ◽  
...  

AbstractSurvival rates of cancer patients vary widely within and between malignancies. While genetic aberrations are at the root of all cancers, individual genomic features cannot explain these distinct disease outcomes. In contrast, intra-tumour heterogeneity (ITH) has the potential to elucidate pan-cancer survival rates and the biology that drives cancer prognosis. Unfortunately, a comprehensive and effective framework to measure ITH across cancers is missing. Here, we introduce a scalable measure of chromosomal copy number heterogeneity (CNH) that predicts patient survival across cancers. We show that the level of ITH can be derived from a single-sample copy number profile. Using gene-expression data and live cell imaging we demonstrate that ongoing chromosomal instability underlies the observed heterogeneity. Analysing 11,534 primary cancer samples from 37 different malignancies, we find that copy number heterogeneity can be accurately deduced and predicts cancer survival across tissues of origin and stages of disease. Our results provide a unifying molecular explanation for the different survival rates observed between cancer types.


2010 ◽  
Vol 9 ◽  
pp. CIN.S4020 ◽  
Author(s):  
Chen Zhao ◽  
Michael L. Bittner ◽  
Robert S. Chapkin ◽  
Edward R. Dougherty

When confronted with a small sample, feature-selection algorithms often fail to find good feature sets, a problem exacerbated for high-dimensional data and large feature sets. The problem is compounded by the fact that, if one obtains a feature set with a low error estimate, the estimate is unreliable because training-data-based error estimators typically perform poorly on small samples, exhibiting optimistic bias or high variance. One way around the problem is limit the number of features being considered, restrict features sets to sizes such that all feature sets can be examined by exhaustive search, and report a list of the best performing feature sets. If the list is short, then it greatly restricts the possible feature sets to be considered as candidates; however, one can expect the lowest error estimates obtained to be optimistically biased so that there may not be a close-to-optimal feature set on the list. This paper provides a power analysis of this methodology; in particular, it examines the kind of results one should expect to obtain relative to the length of the list and the number of discriminating features among those considered. Two measures are employed. The first is the probability that there is at least one feature set on the list whose true classification error is within some given tolerance of the best feature set and the second is the expected number of feature sets on the list whose true errors are within the given tolerance of the best feature set. These values are plotted as functions of the list length to generate power curves. The results show that, if the number of discriminating features is not too small—that is, the prior biological knowledge is not too poor—then one should expect, with high probability, to find good feature sets. Availability: companion website at http://gsp.tamu.edu/Publications/supplementary/zhao09a/


2016 ◽  
Author(s):  
Marta R. Hidalgo ◽  
Cankut Cubuk ◽  
Alicia Amadoz ◽  
Francisco Salavert ◽  
José Carbonell-Caballero ◽  
...  

AbstractUnderstanding the aspects of the cell functionality that account for disease or drug action mechanisms is a main challenge for precision medicine. Here we propose a new method that models cell signaling using biological knowledge on signal transduction. The method recodes individual gene expression values (and/or gene mutations) into accurate measurements of changes in the activity of signaling circuits, which ultimately constitute high-throughput estimations of cell functionalities caused by gene activity within the pathway. Moreover, such estimations can be obtained either at cohort-level, in case/control comparisons, or personalized for individual patients. The accuracy of the method is demonstrated in an extensive analysis involving 5640 patients from 12 different cancer types. Circuit activity measurements not only have a high diagnostic value but also can be related to relevant disease outcomes such as survival, and can be used to assess therapeutic interventions.


2017 ◽  
Author(s):  
Stefano Beretta ◽  
Mauro Castelli ◽  
Ivo Gonçalves ◽  
Ivan Merelli ◽  
Daniele Ramazzotti

AbstractGene and protein networks are very important to model complex large-scale systems in molecular biology. Inferring or reverseengineering such networks can be defined as the process of identifying gene/protein interactions from experimental data through computational analysis. However, this task is typically complicated by the enormously large scale of the unknowns in a rather small sample size. Furthermore, when the goal is to study causal relationships within the network, tools capable of overcoming the limitations of correlation networks are required. In this work, we make use of Bayesian Graphical Models to attach this problem and, specifically, we perform a comparative study of different state-of-the-art heuristics, analyzing their performance in inferring the structure of the Bayesian Network from breast cancer data.


2021 ◽  
Author(s):  
Xin Chen ◽  
Qingrun Zhang ◽  
Thierry Chekouo

Abstract Background: DNA methylations in critical regions are highly involved in cancer pathogenesis and drug response. However, to identify causal methylations out of a large number of potential polymorphic DNA methylation sites is challenging. This high-dimensional data brings two obstacles: first, many established statistical models are not scalable to so many features; second, multiple-test and overfitting become serious. To this end, a method to quickly filter candidate sites to narrow down targets for downstream analyses is urgently needed. Methods: BACkPAy is a pre-screening Bayesian approach to detect biological meaningful clusters of potential differential methylation levels with small sample size. BACkPAy prioritizes potentially important biomarkers by the Bayesian false discovery rate (FDR) approach. It filters non-informative sites (i.e. non-differential) with flat methylation pattern levels accross experimental conditions. In this work, we applied BACkPAy to a genome-wide methylation dataset with 3 tissue types and each type contains 3 gastric cancer samples. We also applied LIMMA (Linear Models for Microarray and RNA-Seq Data) to compare its results with what we achieved by BACkPAy. Then, Cox proportional hazards regression models were utilized to visualize prognostics significant markers with The Cancer Genome Atlas (TCGA) data for survival analysis. Results: Using BACkPAy, we identified 8 biological meaningful clusters/groups of differential probes from the DNA methylation dataset. Using TCGA data, we also identified five prognostic genes (i.e. predictive to the progression of gastric cancer) that contain some differential methylation probes, whereas no significant results was identified using the Benjamin-Hochberg FDR in LIMMA. Conclusions: We showed the importance of using BACkPAy for the analysis of DNA methylation data with extremely small sample size in gastric cancer. We revealed that RDH13, CLDN11, TMTC1, UCHL1 and FOXP2 can serve as predictive biomarkers for gastric cancer treatment and the promoter methylation level of these five genes in serum could have prognostic and diagnostic functions in gastric cancer patients.


Thorax ◽  
2017 ◽  
Vol 73 (4) ◽  
pp. 339-349 ◽  
Author(s):  
Margreet Lüchtenborg ◽  
Eva J A Morris ◽  
Daniela Tataru ◽  
Victoria H Coupland ◽  
Andrew Smith ◽  
...  

IntroductionThe International Cancer Benchmarking Partnership (ICBP) identified significant international differences in lung cancer survival. Differing levels of comorbid disease across ICBP countries has been suggested as a potential explanation of this variation but, to date, no studies have quantified its impact. This study investigated whether comparable, robust comorbidity scores can be derived from the different routine population-based cancer data sets available in the ICBP jurisdictions and, if so, use them to quantify international variation in comorbidity and determine its influence on outcome.MethodsLinked population-based lung cancer registry and hospital discharge data sets were acquired from nine ICBP jurisdictions in Australia, Canada, Norway and the UK providing a study population of 233 981 individuals. For each person in this cohort Charlson, Elixhauser and inpatient bed day Comorbidity Scores were derived relating to the 4–36 months prior to their lung cancer diagnosis. The scores were then compared to assess their validity and feasibility of use in international survival comparisons.ResultsIt was feasible to generate the three comorbidity scores for each jurisdiction, which were found to have good content, face and concurrent validity. Predictive validity was limited and there was evidence that the reliability was questionable.ConclusionThe results presented here indicate that interjurisdictional comparability of recorded comorbidity was limited due to probable differences in coding and hospital admission practices in each area. Before the contribution of comorbidity on international differences in cancer survival can be investigated an internationally harmonised comorbidity index is required.


2018 ◽  
Vol 82 (1) ◽  
pp. 49-54 ◽  
Author(s):  
Laurent Claret ◽  
Christina Pentafragka ◽  
Sanja Karovic ◽  
Binsheng Zhao ◽  
Lawrence H. Schwartz ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document