scholarly journals DeepOS: pan-cancer prognosis estimation from RNA-sequencing data

Author(s):  
Marie PAVAGEAU ◽  
Louis REBAUD ◽  
Daphne MOREL ◽  
Stergios CHRISTODOULIDIS ◽  
Eric DEUTSCH ◽  
...  

RNA sequencing (RNAseq) analysis offers a tumor centered approach of growing interest for personalizing cancer care. However, existing methods , including deep learning models, struggle to reach satisfying performances on survival prediction based upon pan-cancer RNAseq data. Here, we present DeepOS, a novel deep learning model that predicts overall survival (OS) from pancancer RNAseq with a concordance index of 0.715 and a survival AUC of 0.752 across 33 TCGA tumor types whilst tested on an unseen test cohort. DeepOS notably uses (i) prior biological knowledge to condense inputs dimensionality, (ii) transfer learning to enlarge its training capacity through pretraining on organ prediction, and (iii) mean squared error adapted to survival loss function; all of which contributed to improve the model performances. Interpretation showed that DeepOS learned biologically relevant prognosis biomarkers. Altogether, DeepOS achieved unprecedented and consistent performances on pan-cancer prognosis estimation from individual RNA-seq data.

Genes ◽  
2019 ◽  
Vol 10 (3) ◽  
pp. 240 ◽  
Author(s):  
Gangcai Xie ◽  
Chengliang Dong ◽  
Yinfei Kong ◽  
Jiang Zhong ◽  
Mingyao Li ◽  
...  

Accurate prognosis of patients with cancer is important for the stratification of patients, the optimization of treatment strategies, and the design of clinical trials. Both clinical features and molecular data can be used for this purpose, for instance, to predict the survival of patients censored at specific time points. Multi-omics data, including genome-wide gene expression, methylation, protein expression, copy number alteration, and somatic mutation data, are becoming increasingly common in cancer studies. To harness the rich information in multi-omics data, we developed GDP (Group lass regularized Deep learning for cancer Prognosis), a computational tool for survival prediction using both clinical and multi-omics data. GDP integrated a deep learning framework and Cox proportional hazard model (CPH) together, and applied group lasso regularization to incorporate gene-level group prior knowledge into the model training process. We evaluated its performance in both simulated and real data from The Cancer Genome Atlas (TCGA) project. In simulated data, our results supported the importance of group prior information in the regularization of the model. Compared to the standard lasso regularization, we showed that group lasso achieved higher prediction accuracy when the group prior knowledge was provided. We also found that GDP performed better than CPH for complex survival data. Furthermore, analysis on real data demonstrated that GDP performed favorably against other methods in several cancers with large-scale omics data sets, such as glioblastoma multiforme, kidney renal clear cell carcinoma, and bladder urothelial carcinoma. In summary, we demonstrated that GDP is a powerful tool for prognosis of patients with cancer, especially when large-scale molecular features are available.


2020 ◽  
Author(s):  
Yeping Lina Qiu ◽  
Hong Zheng ◽  
Arnout Devos ◽  
Olivier Gevaert

AbstractRNA sequencing has emerged as a promising approach in cancer prognosis as sequencing data becomes more easily and affordably accessible. However, it remains challenging to build good predictive models especially when the sample size is limited and the number of features is high, which is a common situation in biomedical settings. To address these limitations, we propose a meta-learning framework based on neural networks for survival analysis and evaluate it in a genomic cancer research setting. We demonstrate that, compared to regular transfer-learning, meta-learning is a significantly more effective paradigm to leverage high-dimensional data that is relevant but not directly related to the problem of interest. Specifically, meta-learning explicitly constructs a model, from abundant data of relevant tasks, to learn a new task with few samples effectively. For the application of predicting cancer survival outcome, we also show that the meta-learning framework with a few samples is able to achieve competitive performance with learning from scratch with a significantly larger number of samples. Finally, we demonstrate that the meta-learning model implicitly prioritizes genes based on their contribution to survival prediction and allows us to identify important pathways in cancer.


GigaScience ◽  
2019 ◽  
Vol 8 (10) ◽  
Author(s):  
Yun-Ching Chen ◽  
Abhilash Suresh ◽  
Chingiz Underbayev ◽  
Clare Sun ◽  
Komudi Singh ◽  
...  

AbstractBackgroundIn single-cell RNA-sequencing analysis, clustering cells into groups and differentiating cell groups by differentially expressed (DE) genes are 2 separate steps for investigating cell identity. However, the ability to differentiate between cell groups could be affected by clustering. This interdependency often creates a bottleneck in the analysis pipeline, requiring researchers to repeat these 2 steps multiple times by setting different clustering parameters to identify a set of cell groups that are more differentiated and biologically relevant.FindingsTo accelerate this process, we have developed IKAP—an algorithm to identify major cell groups and improve differentiating cell groups by systematically tuning parameters for clustering. We demonstrate that, with default parameters, IKAP successfully identifies major cell types such as T cells, B cells, natural killer cells, and monocytes in 2 peripheral blood mononuclear cell datasets and recovers major cell types in a previously published mouse cortex dataset. These major cell groups identified by IKAP present more distinguishing DE genes compared with cell groups generated by different combinations of clustering parameters. We further show that cell subtypes can be identified by recursively applying IKAP within identified major cell types, thereby delineating cell identities in a multi-layered ontology.ConclusionsBy tuning the clustering parameters to identify major cell groups, IKAP greatly improves the automation of single-cell RNA-sequencing analysis to produce distinguishing DE genes and refine cell ontology using single-cell RNA-sequencing data.


2021 ◽  
Vol 12 ◽  
Author(s):  
Haiya Bai ◽  
Youliang Wang ◽  
Huimin Liu ◽  
Junyang Lu

We aim to find a biomarker that can effectively predict the prognosis of patients with cutaneous melanoma (CM). The RNA sequencing data of CM was downloaded from The Cancer Genome Atlas (TCGA) database and randomly divided into training group and test group. Survival statistical analysis and machine-learning approaches were performed on the RNA sequencing data of CM to develop a prognostic signature. Using univariable Cox proportional hazards regression, random survival forest algorithm, and receiver operating characteristic (ROC) in the training group, the four-mRNA signature including CD276, UQCRFS1, HAPLN3, and PIP4P1 was screened out. The four-mRNA signature could divide patients into low-risk and high-risk groups with different survival outcomes (log-rank p < 0.001). The predictive efficacy of the four-mRNA signature was confirmed in the test group, the whole TCGA group, and the independent GSE65904 (log-rank p < 0.05). The independence of the four-mRNA signature in prognostic prediction was demonstrated by multivariate Cox analysis. ROC and timeROC analyses showed that the efficiency of the signature in survival prediction was better than other clinical variables such as melanoma Clark level and tumor stage. This study highlights that the four-mRNA model could be used as a prognostic signature for CM patients with potential clinical application value.


2019 ◽  
Author(s):  
William C. Wright ◽  
Taosheng Chen

Abstract Here we obtained RNA-sequencing data from the publicly-available Pan-Cancer analysis project performed by The Cancer Genome Atlas (TCGA). Data within this project were processed the same experimentally, and analyzed downstream by the UCSC Toil recompute project. We reprocessed the resulting gene count files in batch to obtain normalized expression, which is a step critical for proper and comparable interpretation. We describe the linear modeling and normalization protocol, and provide an example of plotting the results using a gene of interest. We perform the entire protocol using freely available packages within the R framework.


2021 ◽  
Author(s):  
Benjamin Haibe-Kains ◽  
Michal Kazmierski ◽  
Mattea Welch ◽  
Sejin Kim ◽  
Chris McIntosh ◽  
...  

Abstract Accurate prognosis for an individual patient is a key component of precision oncology. Recent advances in machine learning have enabled the development of models using a wider range of data, including imaging. Radiomics aims to extract quantitative predictive and prognostic biomarkers from routine medical imaging, but evidence for computed tomography radiomics for prognosis remains inconclusive. We have conducted an institutional machine learning challenge to develop an accurate model for overall survival prediction in head and neck cancer using clinical data etxracted from electronic medical records and pre-treatment radiological images, as well as to evaluate the true added benefit of radiomics for head and neck cancer prognosis. Using a large, retrospective dataset of 2,552 patients and a rigorous evaluation framework, we compared 12 different submissions using imaging and clinical data, separately or in combination. The winning approach used non-linear, multitask learning on clinical data and tumour volume, achieving high prognostic accuracy for 2-year and lifetime survival prediction and outperforming models relying on clinical data only, engineered radiomics and deep learning. Combining all submissions in an ensemble model resulted in improved accuracy, with the highest gain from a image-based deep learning model. Our results show the potential of machine learning and simple, informative prognostic factors in combination with large datasets as a tool to guide personalized cancer care.


2018 ◽  
Author(s):  
Axel Theorell ◽  
Yenan Troi Bryceson ◽  
Jakob Theorell

AbstractTechnological advances have facilitated an exponential increase in the amount of information that can be derived from single cells, necessitating new computational tools that can make this highly complex data interpretable. Here, we introduce DEPECHE, a rapid, parameter free, sparse k-means-based algorithm for clustering of multi-and megavariate single-cell data. In a number of computational benchmarks aimed at evaluating the capacity to form biologically relevant clusters, including flow/mass-cytometry and single cell RNA sequencing data sets with manually curated gold standard solutions, DEPECHE clusters as well or better as the best performing state-of-the-art clustering algorithms. However, the main advantage of DEPECHE, compared to the state-of-the-art, is its unique ability to enhance interpretability of the formed clusters, in that it only retains variables relevant for cluster separation, thereby facilitating computational efficient analyses as well as understanding of complex datasets. An open source R implementation of DEPECHE is available at https://github.com/theorell/DepecheR.Author summaryDEPECHE-a data-mining algorithm for mega-variate dataModern experimental technologies facilitate an array of single cells measurements, e.g. at the RNA-level, generating enormous datasets with thousands of annotated biological markers for each of thousands of cells. To analyze such datasets, researchers routinely apply automated or semi-automated techniques to order the cells into biologically relevant groups. However, even after such groups have been generated, it is often difficult to interpret the biological meaning of these groups since the definition of each group often dependends on thousands of biological markers. Therefore, in this article, we introduce DEPECHE, an algorithm designed to simultaneously group cells and enhance interpretability of the formed groups. DEPECHE defines groups only with respect to biological markers that contribute significantly to differentiate the cells in the group from the rest of the cells, yielding more succinct group definitions. Using the open source R software DepecheR on RNA sequencing data and mass cytometry data, the number of defining markers were reduced up to 1000-fold, thereby increasing interpretability vastly, while maintaining or improving the biological relevance of the groups formed compared to state-of-the-art algorithms.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Luís A. Vale-Silva ◽  
Karl Rohr

AbstractThe age of precision medicine demands powerful computational techniques to handle high-dimensional patient data. We present MultiSurv, a multimodal deep learning method for long-term pan-cancer survival prediction. MultiSurv uses dedicated submodels to establish feature representations of clinical, imaging, and different high-dimensional omics data modalities. A data fusion layer aggregates the multimodal representations, and a prediction submodel generates conditional survival probabilities for follow-up time intervals spanning several decades. MultiSurv is the first non-linear and non-proportional survival prediction method that leverages multimodal data. In addition, MultiSurv can handle missing data, including single values and complete data modalities. MultiSurv was applied to data from 33 different cancer types and yields accurate pan-cancer patient survival curves. A quantitative comparison with previous methods showed that Multisurv achieves the best results according to different time-dependent metrics. We also generated visualizations of the learned multimodal representation of MultiSurv, which revealed insights on cancer characteristics and heterogeneity.


Sign in / Sign up

Export Citation Format

Share Document