scholarly journals Large-scale benchmark study of survival prediction methods using multi-omics data

Author(s):  
Moritz Herrmann ◽  
Philipp Probst ◽  
Roman Hornung ◽  
Vindi Jurinovic ◽  
Anne-Laure Boulesteix

Abstract Multi-omics data, that is, datasets containing different types of high-dimensional molecular variables, are increasingly often generated for the investigation of various diseases. Nevertheless, questions remain regarding the usefulness of multi-omics data for the prediction of disease outcomes such as survival time. It is also unclear which methods are most appropriate to derive such prediction models. We aim to give some answers to these questions through a large-scale benchmark study using real data. Different prediction methods from machine learning and statistics were applied on 18 multi-omics cancer datasets (35 to 1000 observations, up to 100 000 variables) from the database ‘The Cancer Genome Atlas’ (TCGA). The considered outcome was the (censored) survival time. Eleven methods based on boosting, penalized regression and random forest were compared, comprising both methods that do and that do not take the group structure of the omics variables into account. The Kaplan–Meier estimate and a Cox model using only clinical variables were used as reference methods. The methods were compared using several repetitions of 5-fold cross-validation. Uno’s C-index and the integrated Brier score served as performance metrics. The results indicate that methods taking into account the multi-omics structure have a slightly better prediction performance. Taking this structure into account can protect the predictive information in low-dimensional groups—especially clinical variables—from not being exploited during prediction. Moreover, only the block forest method outperformed the Cox model on average, and only slightly. This indicates, as a by-product of our study, that in the considered TCGA studies the utility of multi-omics data for prediction purposes was limited. Contact:[email protected], +49 89 2180 3198 Supplementary information: Supplementary data are available at Briefings in Bioinformatics online. All analyses are reproducible using R code freely available on Github.

Author(s):  
Wei Wang ◽  
Wei Liu

Abstract Motivation Accurately predicting the risk of cancer patients is a central challenge for clinical cancer research. For high-dimensional gene expression data, Cox proportional hazard model with the least absolute shrinkage and selection operator for variable selection (Lasso-Cox) is one of the most popular feature selection and risk prediction algorithms. However, the Lasso-Cox model treats all genes equally, ignoring the biological characteristics of the genes themselves. This often encounters the problem of poor prognostic performance on independent datasets. Results Here, we propose a Reweighted Lasso-Cox (RLasso-Cox) model to ameliorate this problem by integrating gene interaction information. It is based on the hypothesis that topologically important genes in the gene interaction network tend to have stable expression changes. We used random walk to evaluate the topological weight of genes, and then highlighted topologically important genes to improve the generalization ability of the RLasso-Cox model. Experiments on datasets of three cancer types showed that the RLasso-Cox model improves the prognostic accuracy and robustness compared with the Lasso-Cox model and several existing network-based methods. More importantly, the RLasso-Cox model has the advantage of identifying small gene sets with high prognostic performance on independent datasets, which may play an important role in identifying robust survival biomarkers for various cancer types. Availability and implementation http://bioconductor.org/packages/devel/bioc/html/RLassoCox.html Supplementary information Supplementary data are available at Bioinformatics online.


Genes ◽  
2019 ◽  
Vol 10 (3) ◽  
pp. 240 ◽  
Author(s):  
Gangcai Xie ◽  
Chengliang Dong ◽  
Yinfei Kong ◽  
Jiang Zhong ◽  
Mingyao Li ◽  
...  

Accurate prognosis of patients with cancer is important for the stratification of patients, the optimization of treatment strategies, and the design of clinical trials. Both clinical features and molecular data can be used for this purpose, for instance, to predict the survival of patients censored at specific time points. Multi-omics data, including genome-wide gene expression, methylation, protein expression, copy number alteration, and somatic mutation data, are becoming increasingly common in cancer studies. To harness the rich information in multi-omics data, we developed GDP (Group lass regularized Deep learning for cancer Prognosis), a computational tool for survival prediction using both clinical and multi-omics data. GDP integrated a deep learning framework and Cox proportional hazard model (CPH) together, and applied group lasso regularization to incorporate gene-level group prior knowledge into the model training process. We evaluated its performance in both simulated and real data from The Cancer Genome Atlas (TCGA) project. In simulated data, our results supported the importance of group prior information in the regularization of the model. Compared to the standard lasso regularization, we showed that group lasso achieved higher prediction accuracy when the group prior knowledge was provided. We also found that GDP performed better than CPH for complex survival data. Furthermore, analysis on real data demonstrated that GDP performed favorably against other methods in several cancers with large-scale omics data sets, such as glioblastoma multiforme, kidney renal clear cell carcinoma, and bladder urothelial carcinoma. In summary, we demonstrated that GDP is a powerful tool for prognosis of patients with cancer, especially when large-scale molecular features are available.


2021 ◽  
Author(s):  
Khandakar Tanvir Ahmed ◽  
Jiao Sun ◽  
Jeongsik Yong ◽  
Wei Zhang

Accurate disease phenotype prediction plays an important role in the treatment of heterogeneous diseases like cancer in the era of precision medicine. With the advent of high throughput technologies, more comprehensive multi-omics data is now available that can effectively link the genotype to phenotype. However, the interactive relation of multi-omics datasets makes it particularly challenging to incorporate different biological layers to discover the coherent biological signatures and predict phenotypic outcomes. In this study, we introduce omicsGAN, a generative adversarial network (GAN) model to integrate two omics data and their interaction network. The model captures information from the interaction network as well as the two omics datasets and fuse them to generate synthetic data with better predictive signals. Large-scale experiments on The Cancer Genome Atlas (TCGA) breast cancer and ovarian cancer datasets validate that (1) the model can effectively integrate two omics data (i.e., mRNA and microRNA expression data) and their interaction network (i.e., microRNA-mRNA interaction network). The synthetic omics data generated by the proposed model has a better performance on cancer outcome classification and patients survival prediction compared to original omics datasets. (2) The integrity of the interaction network plays a vital role in the generation of synthetic data with higher predictive quality. Using a random interaction network does not allow the framework to learn meaningful information from the omics datasets; therefore, results in synthetic data with weaker predictive signals.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Jiarui Feng ◽  
Heming Zhang ◽  
Fuhai Li

Abstract Background Survival analysis is an important part of cancer studies. In addition to the existing Cox proportional hazards model, deep learning models have recently been proposed in survival prediction, which directly integrates multi-omics data of a large number of genes using the fully connected dense deep neural network layers, which are hard to interpret. On the other hand, cancer signaling pathways are important and interpretable concepts that define the signaling cascades regulating cancer development and drug resistance. Thus, it is important to investigate potential associations between patient survival and individual signaling pathways, which can help domain experts to understand deep learning models making specific predictions. Results In this exploratory study, we proposed to investigate the relevance and influence of a set of core cancer signaling pathways in the survival analysis of cancer patients. Specifically, we built a simplified and partially biologically meaningful deep neural network, DeepSigSurvNet, for survival prediction. In the model, the gene expression and copy number data of 1967 genes from 46 major signaling pathways were integrated in the model. We applied the model to four types of cancer and investigated the influence of the 46 signaling pathways in the cancers. Interestingly, the interpretable analysis identified the distinct patterns of these signaling pathways, which are helpful in understanding the relevance of signaling pathways in terms of their application to the prediction of cancer patients’ survival time. These highly relevant signaling pathways, when combined with other essential signaling pathways inhibitors, can be novel targets for drug and drug combination prediction to improve cancer patients’ survival time. Conclusion The proposed DeepSigSurvNet model can facilitate the understanding of the implications of signaling pathways on cancer patients’ survival by integrating multi-omics data and clinical factors.


2020 ◽  
Vol 36 (11) ◽  
pp. 3409-3417
Author(s):  
Lauren Spirko-Burns ◽  
Karthik Devarajan

Abstract Motivation One of the major goals in large-scale genomic studies is to identify genes with a prognostic impact on time-to-event outcomes which provide insight into the disease process. With rapid developments in high-throughput genomic technologies in the past two decades, the scientific community is able to monitor the expression levels of tens of thousands of genes and proteins resulting in enormous datasets where the number of genomic features is far greater than the number of subjects. Methods based on univariate Cox regression are often used to select genomic features related to survival outcome; however, the Cox model assumes proportional hazards (PH), which is unlikely to hold for each feature. When applied to genomic features exhibiting some form of non-proportional hazards (NPH), these methods could lead to an under- or over-estimation of the effects. We propose a broad array of marginal screening techniques that aid in feature ranking and selection by accommodating various forms of NPH. First, we develop an approach based on Kullback–Leibler information divergence and the Yang–Prentice model that includes methods for the PH and proportional odds (PO) models as special cases. Next, we propose R2 measures for the PH and PO models that can be interpreted in terms of explained randomness. Lastly, we propose a generalized pseudo-R2 index that includes PH, PO, crossing hazards and crossing odds models as special cases and can be interpreted as the percentage of separability between subjects experiencing the event and not experiencing the event according to feature measurements. Results We evaluate the performance of our measures using extensive simulation studies and publicly available datasets in cancer genomics. We demonstrate that the proposed methods successfully address the issue of NPH in genomic feature selection and outperform existing methods. Availability and implementation R code for the proposed methods is available at github.com/lburns27/Feature-Selection. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Di Wang ◽  
Zheng Jing ◽  
Kevin He ◽  
Lana X Garmire

Abstract Summary Cox-nnet is a neural-network-based prognosis prediction method, originally applied to genomics data. Here, we propose the version 2 of Cox-nnet, with significant improvement on efficiency and interpretability, making it suitable to predict prognosis based on large-scale population data, including those electronic medical records (EMR) datasets. We also add permutation-based feature importance scores and the direction of feature coefficients. When applied on a kidney transplantation dataset, Cox-nnet v2.0 reduces the training time of Cox-nnet up to 32-folds (n =10 000) and achieves better prediction accuracy than Cox-PH (P<0.05). It also achieves similarly superior performance on a publicly available SUPPORT data (n=8000). The high efficiency and accuracy make Cox-nnet v2.0 a desirable method for survival prediction in large-scale EMR data. Availability and implementation Cox-nnet v2.0 is freely available to the public at https://github.com/lanagarmire/Cox-nnet-v2.0. Supplementary information Supplementary data are available at Bioinformatics online.


2021 ◽  
Author(s):  
Sara Althubaiti ◽  
Maxat Kulmanov ◽  
Yang Liu ◽  
Georgios V Gkoutos ◽  
Paul Schofield ◽  
...  

AbstractCombining multiple types of genomic, transcriptional, proteomic, and epigenetic datasets has the potential to reveal biological mechanisms across multiple scales, and may lead to more accurate models for clinical decision support. Developing efficient models that can derive clinical outcomes from high-dimensional data remains problematical; challenges include the integration of multiple types of omics data, inclusion of biological background knowledge, and developing machine learning models that are able to deal with this high dimensionality while having only few samples from which to derive a model. We developed DeepMOCCA, a framework for multi-omics cancer analysis. We combine different types of omics data using biological relations between genes, transcripts, and proteins, combine the multi-omics data with background knowledge in the form of protein–protein interaction networks, and use graph convolution neural networks to exploit this combination of multi-omics data and background knowledge. DeepMOCCA predicts survival time for individual patient samples for 33 cancer types and outperforms most existing survival prediction methods. Moreover, DeepMOCCA includes a graph attention mechanism which prioritizes driver genes and prognostic markers in a patient-specific manner; the attention mechanism can be used to identify drivers and prognostic markers within cohorts and individual patients.Author summaryLinking the features of tumors to a prognosis for the patient is a critical part of managing cancer. Many methods have been applied to this problem but we still lack accurate prognostic markers for many cancers. We now have more information than ever before on the state of the cancer genome, the epigenetic changes in tumors, and gene expression at both RNA and protein levels. Here, we address the question of how this data can be used to predict cancer survival and discover which tumor genes make the greatest contribution to the prognosis in individual tumor samples. We have developed a computational model, DeepMOCCA, that uses artificial neural networks underpinned by a large graph constructed from background knowledge concerning the functional interactions between genes and their products. We show that with our method, DeepMOCCA can predict cancer survival time based entirely on features of the tumor at a cellular and molecular level. The method confirms many existing genes that affect survival but for some cancers suggests new genes, either not implicated in survival before or not known to be important in that particular cancer. The ability to predict the important features in individual tumors provided by our method raises the possibility of personalized therapy based on the gene or network dominating the prognosis for that patient.


2020 ◽  
Vol 34 (6) ◽  
pp. 795-805
Author(s):  
Maiken B Hansen ◽  
Lone Ross Nylandsted ◽  
Morten A Petersen ◽  
Mathilde Adsersen ◽  
Leslye Rojas-Concha ◽  
...  

Background: Large, nationally representative studies of the association between quality of life and survival time in cancer patients in specialized palliative care are missing. Aim: The aim of this study was to investigate whether symptoms/problems at admission to specialized palliative care were associated with survival and if the symptoms/problems may improve prediction of death within 1 week and 1 month, respectively. Setting/participants: All cancer patients who had filled in the EORTC QLQ-C15-PAL at admission to specialized palliative care in Denmark in 2010–2017 were included through the Danish Palliative Care Database. Cox regression was used to identify clinical variables (gender, age, type of contact (inpatient vs outpatient), and cancer site) and symptoms/problems significantly associated with survival. To test whether symptoms/problems improved survival predictions, the overall accuracy (area under the receiver operating characteristic curve) for different prediction models was compared. The validity of the prediction models was tested with data on 5,508 patients admitted to palliative care in 2018. Results: The study included 30,969 patients with an average age of 68.9 years; 50% were women. Gender, age, type of contact, cancer site, and most symptoms/problems were significantly associated with survival time. The predictive value of symptoms/problems was trivial except for physical function, which clearly improved the overall accuracy for 1-week and 1-month predictions of death when added to models including only clinical variables. Conclusion: Most symptoms/problems were significantly associated with survival and mainly physical function improved predictions of death. Interestingly, the predictive value of physical function was the same as all clinical variables combined (in hospice) or even higher (in palliative care teams).


Sign in / Sign up

Export Citation Format

Share Document