scholarly journals Predicting lung adenocarcinoma disease progression using methylation-correlated blocks and ensemble machine learning classifiers

PeerJ ◽  
2021 ◽  
Vol 9 ◽  
pp. e10884
Author(s):  
Xin Yu ◽  
Qian Yang ◽  
Dong Wang ◽  
Zhaoyang Li ◽  
Nianhang Chen ◽  
...  

Applying the knowledge that methyltransferases and demethylases can modify adjacent cytosine-phosphorothioate-guanine (CpG) sites in the same DNA strand, we found that combining multiple CpGs into a single block may improve cancer diagnosis. However, survival prediction remains a challenge. In this study, we developed a pipeline named “stacked ensemble of machine learning models for methylation-correlated blocks” (EnMCB) that combined Cox regression, support vector regression (SVR), and elastic-net models to construct signatures based on DNA methylation-correlated blocks for lung adenocarcinoma (LUAD) survival prediction. We used methylation profiles from the Cancer Genome Atlas (TCGA) as the training set, and profiles from the Gene Expression Omnibus (GEO) as validation and testing sets. First, we partitioned the genome into blocks of tightly co-methylated CpG sites, which we termed methylation-correlated blocks (MCBs). After partitioning and feature selection, we observed different diagnostic capacities for predicting patient survival across the models. We combined the multiple models into a single stacking ensemble model. The stacking ensemble model based on the top-ranked block had the area under the receiver operating characteristic curve of 0.622 in the TCGA training set, 0.773 in the validation set, and 0.698 in the testing set. When stratified by clinicopathological risk factors, the risk score predicted by the top-ranked MCB was an independent prognostic factor. Our results showed that our pipeline was a reliable tool that may facilitate MCB selection and survival prediction.

2021 ◽  
Vol 14 (1) ◽  
Author(s):  
Rui-kun Zhang ◽  
Jia-lin Liu

Abstract Background Hepatocellular carcinoma (HCC) is one of the most common and invasive malignant tumors in the world. The change in DNA methylation is a key event in HCC. Methods Methylation datasets for HCC and 17 other types of cancer were downloaded from The Cancer Genome Atlas (TCGA). The CpG sites with large differences in methylation between tumor tissues and paracancerous tissues were identified. We used the HCC methylation dataset downloaded from the TCGA as the training set and removed the overlapping sites among all cancer datasets to ensure that only CpG sites specific to HCC remained. Logistic regression analysis was performed to select specific biomarkers that can be used to diagnose HCC, and two datasets—GSE157341 and GSE54503—downloaded from GEO as validation sets were used to validate our model. We also used a Cox regression model to select CpG sites related to patient prognosis. Results We identified 6 HCC-specific methylated CpG sites as biomarkers for HCC diagnosis. In the training set, the area under the receiver operating characteristic (ROC) curve (AUC) for the model containing all these sites was 0.971. The AUCs were 0.8802 and 0.9711 for the two validation sets from the GEO database. In addition, 3 other CpG sites were analyzed and used to create a risk scoring model for patient prognosis and survival prediction. Conclusions Through the analysis of HCC methylation datasets from the TCGA and Gene Expression Omnibus (GEO) databases, potential biomarkers for HCC diagnosis and prognosis evaluation were ascertained.


2020 ◽  
Vol 18 (1) ◽  
Author(s):  
Qidong Cai ◽  
Boxue He ◽  
Pengfei Zhang ◽  
Zhenyu Zhao ◽  
Xiong Peng ◽  
...  

Abstract Background Alternative splicing (AS) plays critical roles in generating protein diversity and complexity. Dysregulation of AS underlies the initiation and progression of tumors. Machine learning approaches have emerged as efficient tools to identify promising biomarkers. It is meaningful to explore pivotal AS events (ASEs) to deepen understanding and improve prognostic assessments of lung adenocarcinoma (LUAD) via machine learning algorithms. Method RNA sequencing data and AS data were extracted from The Cancer Genome Atlas (TCGA) database and TCGA SpliceSeq database. Using several machine learning methods, we identified 24 pairs of LUAD-related ASEs implicated in splicing switches and a random forest-based classifiers for identifying lymph node metastasis (LNM) consisting of 12 ASEs. Furthermore, we identified key prognosis-related ASEs and established a 16-ASE-based prognostic model to predict overall survival for LUAD patients using Cox regression model, random survival forest analysis, and forward selection model. Bioinformatics analyses were also applied to identify underlying mechanisms and associated upstream splicing factors (SFs). Results Each pair of ASEs was spliced from the same parent gene, and exhibited perfect inverse intrapair correlation (correlation coefficient = − 1). The 12-ASE-based classifier showed robust ability to evaluate LNM status of LUAD patients with the area under the receiver operating characteristic (ROC) curve (AUC) more than 0.7 in fivefold cross-validation. The prognostic model performed well at 1, 3, 5, and 10 years in both the training cohort and internal test cohort. Univariate and multivariate Cox regression indicated the prognostic model could be used as an independent prognostic factor for patients with LUAD. Further analysis revealed correlations between the prognostic model and American Joint Committee on Cancer stage, T stage, N stage, and living status. The splicing network constructed of survival-related SFs and ASEs depicts regulatory relationships between them. Conclusion In summary, our study provides insight into LUAD researches and managements based on these AS biomarkers.


2020 ◽  
Vol 2020 ◽  
pp. 1-11
Author(s):  
Jie Liu ◽  
Shiqiang Hou ◽  
Jinyi Wang ◽  
Zhengjun Chai ◽  
Xuan Hong ◽  
...  

Background. Lung adenocarcinoma (LUAD), a major and fatal subtype of lung cancer, caused lots of mortalities and showed different outcomes in prognosis. This study was to assess key genes and to develop a prognostic signature for the patient therapy with LUAD. Method. RNA expression profile and clinical data from 522 LUAD patients were accessed and downloaded from the Cancer Genome Atlas (TCGA) database. Differentially expressed genes (DEGs) were extracted and analyzed between normal tissues and LUAD samples. Then, a 14-DEG signature was developed and identified for the survival prediction in LUAD patients by means of univariate and multivariate Cox regression analyses. The gene ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis were performed to predict the potential biological functions and pathways of these DEGs. Results. Twenty-two out of 5924 DEGs in the TCGA dataset were screened and associated with the overall survival (OS) of LUAD patients. 14CID="C008" value=" "DEGs were finally selected and included in our development and validation model by risk score analysis. The ROC analysis indicated that the specificity and sensitivity of this profile signature were high. Further functional enrichment analyses indicated that these DEGs might regulate genes that affect the function of release of sequestered calcium ion into cytosol and pathways that associated with vibrio cholerae infection. Conclusion. Our study developed a novel 14-DEG signature providing more efficient and persuasive prognostic information beyond conventional clinicopathological factors for survival prediction of LUAD patients.


BMC Cancer ◽  
2021 ◽  
Vol 21 (1) ◽  
Author(s):  
Yang Zhai ◽  
Bin Zhao ◽  
Yuzhen Wang ◽  
Lina Li ◽  
Jingjin Li ◽  
...  

Abstract Background Lung adenocarcinoma (LUAD) is the most common pathology subtype of lung cancer. In recent years, immunotherapy, targeted therapy and chemotherapeutics conferred a certain curative effects. However, the effect and prognosis of LUAD patients are different, and the efficacy of existing LUAD risk prediction models is unsatisfactory. Methods The Cancer Genome Atlas (TCGA) LUAD dataset was downloaded. The differentially expressed immune genes (DEIGs) were analyzed with edgeR and DESeq2. The prognostic DEIGs were identified by COX regression. Protein-protein interaction (PPI) network was inferred by STRING using prognostic DEIGs with p value< 0.05. The prognostic model based on DEIGs was established using Lasso regression. Immunohistochemistry was used to assess the expression of FERMT2, FKBP3, SMAD9, GATA2, and ITIH4 in 30 cases of LUAD tissues. Results In total,1654 DEIGs were identified, of which 436 genes were prognostic. Gene functional enrichment analysis indicated that the DEIGs were involved in inflammatory pathways. We constructed 4 models using DEIGs. Finally, model 4, which was constructed using the 436 DEIGs performed the best in prognostic predictions, the receiver operating characteristic curve (ROC) was 0.824 for 3 years, 0.838 for 5 years, 0.834 for 10 years. High levels of FERMT2, FKBP3 and low levels of SMAD9, GATA2, ITIH4 expression are related to the poor overall survival in LUAD (p < 0.05). The prognostic model based on DEIGs reflected infiltration by immune cells. Conclusions In our study, we built an optimal prognostic signature for LUAD using DEIGs and verified the expression of selected genes in LUAD. Our result suggests immune signature can be harnessed to obtain prognostic insights.


Author(s):  
Qian Xu ◽  
Yurong Chen

Aging is an inevitable time-dependent process associated with a gradual decline in many physiological functions. Importantly, some studies have supported that aging may be involved in the development of lung adenocarcinoma (LUAD). However, no studies have described an aging-related gene (ARG)-based prognosis signature for LUAD. Accordingly, in this study, we analyzed ARG expression data from The Cancer Genome Atlas (TCGA) and Gene Expression Omnibus (GEO). After LASSO and Cox regression analyses, a six ARG-based signature (APOC3, EPOR, H2AFX, MXD1, PLCG2, and YWHAZ) was constructed using TCGA dataset that significantly stratified cases into high- and low-risk groups in terms of overall survival (OS). Cox regression analysis indicated that the ARG signature was an independent prognostic factor in LUAD. A nomogram based on the ARG signature and clinicopathological factors was developed in TCGA cohort and validated in the GEO dataset. Moreover, to visualize the prediction results, we established a web-based calculator yurong.shinyapps.io/ARGs_LUAD/. Calibration plots showed good consistency between the prediction of the nomogram and actual observations. Receiver operating characteristic curve and decision curve analyses indicated that the ARG nomogram had better OS prediction and clinical net benefit than the staging system. Taken together, these results established a genetic signature for LUAD based on ARGs, which may promote individualized treatment and provide promising novel molecular markers for immunotherapy.


2021 ◽  
Author(s):  
Huan Wang ◽  
Wei Wu ◽  
Chunxia Han ◽  
Jiaqi Zheng ◽  
Xinyu Cai ◽  
...  

BACKGROUND The absolute number of femoral neck fractures (FNFs) is increasing; however, the prediction of traumatic femoral head necrosis remains difficult. Machine learning algorithms have the potential to be superior to traditional prediction methods for the prediction of traumatic femoral head necrosis. OBJECTIVE The aim of this study is to use machine learning to construct a model for the analysis of risk factors and prediction of osteonecrosis of the femoral head (ONFH) in patients with FNF after internal fixation. METHODS We retrospectively collected preoperative, intraoperative, and postoperative clinical data of patients with FNF in 4 hospitals in Shanghai and followed up the patients for more than 2.5 years. A total of 259 patients with 43 variables were included in the study. The data were randomly divided into a training set (181/259, 69.8%) and a validation set (78/259, 30.1%). External data (n=376) were obtained from a retrospective cohort study of patients with FNF in 3 other hospitals. Least absolute shrinkage and selection operator regression and the support vector machine algorithm were used for variable selection. Logistic regression, random forest, support vector machine, and eXtreme Gradient Boosting (XGBoost) were used to develop the model on the training set. The validation set was used to tune the model hyperparameters to determine the final prediction model, and the external data were used to compare and evaluate the model performance. We compared the accuracy, discrimination, and calibration of the models to identify the best machine learning algorithm for predicting ONFH. Shapley additive explanations and local interpretable model-agnostic explanations were used to determine the interpretability of the black box model. RESULTS A total of 11 variables were selected for the models. The XGBoost model performed best on the validation set and external data. The accuracy, sensitivity, and area under the receiver operating characteristic curve of the model on the validation set were 0.987, 0.929, and 0.992, respectively. The accuracy, sensitivity, specificity, and area under the receiver operating characteristic curve of the model on the external data were 0.907, 0.807, 0.935, and 0.933, respectively, and the log-loss was 0.279. The calibration curve demonstrated good agreement between the predicted probability and actual risk. The interpretability of the features and individual predictions were realized using the Shapley additive explanations and local interpretable model-agnostic explanations algorithms. In addition, the XGBoost model was translated into a self-made web-based risk calculator to estimate an individual’s probability of ONFH. CONCLUSIONS Machine learning performs well in predicting ONFH after internal fixation of FNF. The 6-variable XGBoost model predicted the risk of ONFH well and had good generalization ability on the external data, which can be used for the clinical prediction of ONFH after internal fixation of FNF.


2020 ◽  
Vol 2020 ◽  
pp. 1-14
Author(s):  
Xiaonan Zhao ◽  
Zhenzi Bai ◽  
Chenghua Li ◽  
Chuanlun Sheng ◽  
Hongyan Li

Studies have demonstrated the prognosis potential of long noncoding RNAs (lncRNAs) for hepatocellular carcinoma (HCC), but specific lncRNAs for hepatitis B virus- (HBV-) related HCC have rarely been reported. This study was aimed at identifying a lncRNA prognostic signature for HBV-HCC and exploring their underlying functions. The sequencing dataset was collected from The Cancer Genome Atlas database as the training set, while the microarray dataset was obtained from the European Bioinformatics Institute database (E-TABM-36) as the validation set. Univariate and multivariate Cox regression analyses identified that eight lncRNAs (TSPEAR-AS1, LINC00511, LINC01136, MKLN1-AS, LINC00506, KRTAP5-AS1, ZNF252P-AS1, and THUMPD3-AS1) were significantly associated with overall survival (OS). These eight lncRNAs were used to construct a risk score model. The Kaplan-Meier survival curve results showed that this risk score can significantly differentiate the OS between the high-risk group and the low-risk group. Receiver operating characteristic curve analysis demonstrated that this risk score exhibited good prediction effectiveness (area under the curve AUC=0.990 for the training set; AUC=0.903 for the validation set). Furthermore, this lncRNA risk score was identified as an independent prognostic factor in the multivariate analysis after adjusting other clinical characteristics. The crucial coexpression (LINC00511-CABYR, THUMPD3-AS1-TRIP13, LINC01136-SFN, LINC00506-ANLN, and KRTAP5-AS1/TSPEAR-AS1/MKLN1-AS/ZNF252P-AS1-MC1R) or competing endogenous RNA (THUMPD3-AS1-hsa-miR-450a-TRIP13) interaction axes were identified to reveal the possible functions of lncRNAs. These genes were enriched into cell cycle-related biological processes or pathways. In conclusion, our study identified a novel eight-lncRNA prognosis signature for HBV-HCC patients and these lncRNAs may be potential therapeutic targets.


2021 ◽  
Author(s):  
Xia Liu ◽  
Hangzhou Zhu ◽  
Xiaojiu Zha ◽  
Yan Rui ◽  
Miao Li ◽  
...  

Abstract Background: Malignant tumor is the main cause of death in the world, among which lung cancer is the main cause of death. The incidence rate and mortality of lung cancer are increasing year by year. This study aims to elucidate the potential prognostic value of keratin (KRT) gene family members in patients with lung adenocarcinoma (LUAD).Materials and methods: RNA sequencing data were obtained from the Cancer Genome Atlas (TCGA) database of LUAD tumors and paired normal tissues. Multivariate Cox proportional hazards regression analysis was used to evaluate the prognostic value of KRT family member genes. Analyze the screening variables to construct the risk score. The time-dependent ROC curve is used to evaluate the predicted results. Finally, nomograms were used to assess individualized prognostic risk.Result: From the differentially expressed genes, 14 KRT genes with significant imbalance in LUAD tumors and adjacent non-cancerous tissues were screened. Receiver operating characteristic curve (ROC) analysis confirmed that these 14 KRT genes can be used as potential diagnostic markers for the diagnosis of lung adenocarcinoma. Multivariate Cox regression analysis showed that six KRT genes were related to the prognosis of lung cancer. The variables were screened by multivariate Cox regression model. The final results showed that KRT8 and KRT6A were independent risk factors for the prognosis of lung adenocarcinoma.Conclusion: KRT8 and KRT6A can be used as prognostic markers of LUAD. The high expression of KRT8 and KRT6A suggests that the prognosis of LUAD patients is poor.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Wenjie Chen ◽  
Wen Li ◽  
Zhenkun Liu ◽  
Guangzhi Ma ◽  
Yunfu Deng ◽  
...  

AbstractTo identify the prognostic biomarker of the competitive endogenous RNA (ceRNA) and explore the tumor infiltrating immune cells (TIICs) which might be the potential prognostic factors in lung adenocarcinoma. In addition, we also try to explain the crosstalk between the ceRNA and TIICs to explore the molecular mechanisms involved in lung adenocarcinoma. The transcriptome data of lung adenocarcinoma were obtained from The Cancer Genome Atlas (TCGA) database, and the hypergeometric correlation of the differently expressed miRNA-lncRNA and miRNA-mRNA were analyzed based on the starBase. In addition, the Kaplan–Meier survival and Cox regression model analysis were used to identify the prognostic ceRNA network and TIICs. Correlation analysis was performed to analysis the correlation between the ceRNA network and TIICs. In the differently expressed RNAs between tumor and normal tissue, a total of 190 miRNAs, 224 lncRNAs and 3024 mRNAs were detected, and the constructed ceRNA network contained 5 lncRNAs, 92 mRNAs and 10 miRNAs. Then, six prognostic RNAs (FKBP3, GPI, LOXL2, IL22RA1, GPR37, and has-miR-148a-3p) were viewed as the key members for constructing the prognostic prediction model in the ceRNA network, and three kinds of TIICs (Monocytes, Macrophages M1, activated mast cells) were identified to be significantly related with the prognosis in lung adenocarcinoma. Correlation analysis suggested that the FKBP3 was associated with Monocytes and Macrophages M1, and the GPI was obviously related with Monocytes and Macrophages M1. Besides, the LOXL2 was associated with Monocytes and Activated mast cells, and the IL22RA1 was significantly associated with Monocytes and Macrophages M1, while the GPR37 and Macrophages M1 was closely related. The constructed ceRNA network and identified Monocytes, Macrophages M1 and activated Mast cells are all prognostic factors for lung adenocarcinoma. Moreover, the crosstalk between the ceRNA network and TIICs might be a potential molecular mechanism involved.


2020 ◽  
Vol 22 (Supplement_2) ◽  
pp. ii203-ii203
Author(s):  
Alexander Hulsbergen ◽  
Yu Tung Lo ◽  
Vasileios Kavouridis ◽  
John Phillips ◽  
Timothy Smith ◽  
...  

Abstract INTRODUCTION Survival prediction in brain metastases (BMs) remains challenging. Current prognostic models have been created and validated almost completely with data from patients receiving radiotherapy only, leaving uncertainty about surgical patients. Therefore, the aim of this study was to build and validate a model predicting 6-month survival after BM resection using different machine learning (ML) algorithms. METHODS An institutional database of 1062 patients who underwent resection for BM was split into a 80:20 training and testing set. Seven different ML algorithms were trained and assessed for performance. Moreover, an ensemble model was created incorporating random forest, adaptive boosting, gradient boosting, and logistic regression algorithms. Five-fold cross validation was used for hyperparameter tuning. Model performance was assessed using area under the receiver-operating curve (AUC) and calibration and was compared against the diagnosis-specific graded prognostic assessment (ds-GPA); the most established prognostic model in BMs. RESULTS The ensemble model showed superior performance with an AUC of 0.81 in the hold-out test set, a calibration slope of 1.14, and a calibration intercept of -0.08, outperforming the ds-GPA (AUC 0.68). Patients were stratified into high-, medium- and low-risk groups for death at 6 months; these strata strongly predicted both 6-months and longitudinal overall survival (p &lt; 0.001). CONCLUSIONS We developed and internally validated an ensemble ML model that accurately predicts 6-month survival after neurosurgical resection for BM, outperforms the most established model in the literature, and allows for meaningful risk stratification. Future efforts should focus on external validation of our model.


Sign in / Sign up

Export Citation Format

Share Document