Two-Stage Approaches to Accounting for Patient Heterogeneity in Machine Learning Risk Prediction Models in Oncology

2021 ◽  
pp. 1015-1023
Author(s):  
Eun Jeong Oh ◽  
Ravi B. Parikh ◽  
Corey Chivers ◽  
Jinbo Chen

PURPOSE Machine learning models developed from electronic health records data have been increasingly used to predict risk of mortality for general oncology patients. But these models may have suboptimal performance because of patient heterogeneity. The objective of this work is to develop a new modeling approach to predicting short-term mortality that accounts for heterogeneity across multiple subgroups in the presence of a large number of electronic health record predictors. METHODS We proposed a two-stage approach to addressing heterogeneity among oncology patients of different cancer types for predicting their risk of mortality. Structured data were extracted from the University of Pennsylvania Health System for 20,723 patients of 11 cancer types, where 1,340 (6.5%) patients were deceased. We first modeled the overall risk for all patients without differentiating cancer types, as is done in the current practice. We then developed cancer type–specific models using the overall risk score as a predictor along with preselected type-specific predictors. The overall and type-specific models were compared with respect to discrimination using the area under the precision-recall curve (AUPRC) and calibration using the calibration slope. We also proposed metrics that characterize the degree of risk heterogeneity by comparing risk predictors in the overall and type-specific models. RESULTS The two-stage modeling resulted in improved calibration and discrimination across all 11 cancer types. The improvement in AUPRC was significant for hematologic malignancies including leukemia, lymphoma, and myeloma. For instance, the AUPRC increased from 0.358 to 0.519 (∆ = 0.161; 95% CI, 0.102 to 0.224) and from 0.299 to 0.354 (∆ = 0.055; 95% CI, 0.009 to 0.107) for leukemia and lymphoma, respectively. For all 11 cancer types, the two-stage approach generated well-calibrated risks. A high degree of heterogeneity between type-specific and overall risk predictors was observed for most cancer types. CONCLUSION Our two-stage modeling approach that accounts for cancer type–specific risk heterogeneity has improved calibration and discrimination than a model agnostic to cancer types.

2021 ◽  
Vol 39 (15_suppl) ◽  
pp. 3072-3072
Author(s):  
Habte Aragaw Yimer ◽  
Wai Hong Wilson Tang ◽  
Mohan K. Tummala ◽  
Spencer Shao ◽  
Gina G. Chung ◽  
...  

3072 Background: The Circulating Cell-free Genome Atlas study (CCGA; NCT02889978) previously demonstrated that a blood-based multi-cancer early detection (MCED) test utilizing cell-free DNA (cfDNA) sequencing in combination with machine learning could detect cancer signals across multiple cancer types and predict cancer signal origin. Cancer classes were defined within the CCGA study for sensitivity reporting. Separately, cancer types defined by the American Joint Committee on Cancer (AJCC) criteria, which outline unique staging requirements and reflect a distinct combination of anatomic site, histology and other biologic features, were assigned to each cancer participant using the same source data for primary site of origin and histologic type. Here, we report CCGA ‘cancer class’ designation and AJCC ‘cancer type’ assignment within the third and final CCGA3 validation substudy to better characterize the diversity of tumors across which a cancer signal could be detected with the MCED test that is nearing clinical availability. Methods: CCGA is a prospective, multicenter, case-control, observational study with longitudinal follow-up (overall population N = 15,254). Plasma cfDNA from evaluable samples was analyzed using a targeted methylation bisulfite sequencing assay and a machine learning approach, and test performance, including sensitivity, was assessed. For sensitivity reporting, CCGA cancer classes were assigned to cancer participants using a combination of the type of primary cancer reported by the site and tumor characteristics abstracted from the site pathology reports by GRAIL pathologists. Each cancer participant also was separately assigned an AJCC cancer type based on the same source data using AJCC staging manual (8th edition) classifications. Results: A total of 4077 participants comprised the independent validation set with confirmed status (cancer: n = 2823; non-cancer: n = 1254 with non-cancer status confirmed at year-one follow-up). Sensitivity was reported for 24 cancer classes (sample sizes ranged from 10 to 524 participants), as well as an “other” cancer class (59 participants). According to AJCC classification, the MCED test was found to detect cancer signals across 50+ AJCC cancer types, including some types not present in the training set; some cancer types had limited representation. Conclusions: This MCED test that is nearing clinical availability and was evaluated in the third CCGA substudy detected cancer signals across 50+ AJCC cancer types. Reporting CCGA cancer classes and AJCC cancer types demonstrates the ability of the MCED test to detect cancer signals across a set of diverse cancer types representing a wide range of biologic characteristics, including cancer types that the classifier has not been trained on, and supports its use on a population-wide scale. Clinical trial information: NCT02889978.


Cancers ◽  
2021 ◽  
Vol 13 (15) ◽  
pp. 3768
Author(s):  
Vijayachitra Modhukur ◽  
Shakshi Sharma ◽  
Mainak Mondal ◽  
Ankita Lawarde ◽  
Keiu Kask ◽  
...  

Metastatic cancers account for up to 90% of cancer-related deaths. The clear differentiation of metastatic cancers from primary cancers is crucial for cancer type identification and developing targeted treatment for each cancer type. DNA methylation patterns are suggested to be an intriguing target for cancer prediction and are also considered to be an important mediator for the transition to metastatic cancer. In the present study, we used 24 cancer types and 9303 methylome samples downloaded from publicly available data repositories, including The Cancer Genome Atlas (TCGA) and the Gene Expression Omnibus (GEO). We constructed machine learning classifiers to discriminate metastatic, primary, and non-cancerous methylome samples. We applied support vector machines (SVM), Naive Bayes (NB), extreme gradient boosting (XGBoost), and random forest (RF) machine learning models to classify the cancer types based on their tissue of origin. RF outperformed the other classifiers, with an average accuracy of 99%. Moreover, we applied local interpretable model-agnostic explanations (LIME) to explain important methylation biomarkers to classify cancer types.


10.2196/18387 ◽  
2020 ◽  
Vol 22 (8) ◽  
pp. e18387
Author(s):  
Solbi Kweon ◽  
Jeong Hoon Lee ◽  
Younghee Lee ◽  
Yu Rang Park

Background As the need for sharing genomic data grows, privacy issues and concerns, such as the ethics surrounding data sharing and disclosure of personal information, are raised. Objective The main purpose of this study was to verify whether genomic data is sufficient to predict a patient's personal information. Methods RNA expression data and matched patient personal information were collected from 9538 patients in The Cancer Genome Atlas program. Five personal information variables (age, gender, race, cancer type, and cancer stage) were recorded for each patient. Four different machine learning algorithms (support vector machine, decision tree, random forest, and artificial neural network) were used to determine whether a patient's personal information could be accurately predicted from RNA expression data. Performance measurement of the prediction models was based on the accuracy and area under the receiver operating characteristic curve. We selected five cancer types (breast carcinoma, kidney renal clear cell carcinoma, head and neck squamous cell carcinoma, low-grade glioma, and lung adenocarcinoma) with large samples sizes to verify whether predictive accuracy would differ between them. We also validated the efficacy of our four machine learning models in analyzing normal samples from 593 cancer patients. Results In most samples, personal information with high genetic relevance, such as gender and cancer type, could be predicted from RNA expression data alone. The prediction accuracies for gender and cancer type, which were the best models, were 0.93-0.99 and 0.78-0.94, respectively. Other aspects of personal information, such as age, race, and cancer stage, were difficult to predict from RNA expression data, with accuracies ranging from 0.0026-0.29, 0.76-0.96, and 0.45-0.79, respectively. Among the tested machine learning methods, the highest predictive accuracy was obtained using the support vector machine algorithm (mean accuracy 0.77), while the lowest accuracy was obtained using the random forest method (mean accuracy 0.65). Gender and race were predicted more accurately than other variables in the samples. On average, the accuracy of cancer stage prediction ranged between 0.71-0.67, while the age prediction accuracy ranged between 0.18-0.23 for the five cancer types. Conclusions We attempted to predict patient information using RNA expression data. We found that some identifiers could be predicted, but most others could not. This study showed that personal information available from RNA expression data is limited and this information cannot be used to identify specific patients.


2020 ◽  
Author(s):  
Solbi Kweon ◽  
Jeong Hoon Lee ◽  
Younghee Lee ◽  
Yu Rang Park

BACKGROUND As the need for sharing genomic data grows, privacy issues and concerns, such as the ethics surrounding data sharing and disclosure of personal information, are raised. OBJECTIVE The main purpose of this study was to verify whether genomic data is sufficient to predict a patient's personal information. METHODS RNA expression data and matched patient personal information were collected from 9538 patients in The Cancer Genome Atlas program. Five personal information variables (age, gender, race, cancer type, and cancer stage) were recorded for each patient. Four different machine learning algorithms (support vector machine, decision tree, random forest, and artificial neural network) were used to determine whether a patient's personal information could be accurately predicted from RNA expression data. Performance measurement of the prediction models was based on the accuracy and area under the receiver operating characteristic curve. We selected five cancer types (breast carcinoma, kidney renal clear cell carcinoma, head and neck squamous cell carcinoma, low-grade glioma, and lung adenocarcinoma) with large samples sizes to verify whether predictive accuracy would differ between them. We also validated the efficacy of our four machine learning models in analyzing normal samples from 593 cancer patients. RESULTS In most samples, personal information with high genetic relevance, such as gender and cancer type, could be predicted from RNA expression data alone. The prediction accuracies for gender and cancer type, which were the best models, were 0.93-0.99 and 0.78-0.94, respectively. Other aspects of personal information, such as age, race, and cancer stage, were difficult to predict from RNA expression data, with accuracies ranging from 0.0026-0.29, 0.76-0.96, and 0.45-0.79, respectively. Among the tested machine learning methods, the highest predictive accuracy was obtained using the support vector machine algorithm (mean accuracy 0.77), while the lowest accuracy was obtained using the random forest method (mean accuracy 0.65). Gender and race were predicted more accurately than other variables in the samples. On average, the accuracy of cancer stage prediction ranged between 0.71-0.67, while the age prediction accuracy ranged between 0.18-0.23 for the five cancer types. CONCLUSIONS We attempted to predict patient information using RNA expression data. We found that some identifiers could be predicted, but most others could not. This study showed that personal information available from RNA expression data is limited and this information cannot be used to identify specific patients.


2018 ◽  
Vol 115 (6) ◽  
pp. 1322-1327 ◽  
Author(s):  
Byung-Ju Kim ◽  
Sung-Hou Kim

Prevention and early intervention are the most effective ways of avoiding or minimizing psychological, physical, and financial suffering from cancer. However, such proactive action requires the ability to predict the individual’s susceptibility to cancer with a measure of probability. Of the triad of cancer-causing factors (inherited genomic susceptibility, environmental factors, and lifestyle factors), the inherited genomic component may be derivable from the recent public availability of a large body of whole-genome variation data. However, genome-wide association studies have so far showed limited success in predicting the inherited susceptibility to common cancers. We present here a multiple classification approach for predicting individuals’ inherited genomic susceptibility to acquire the most likely phenotype among a panel of 20 major common cancer types plus 1 “healthy” type by application of a supervised machine-learning method under competing conditions among the cohorts of the 21 types. This approach suggests that, depending on the phenotypes of 5,919 individuals of “white” ethnic population in this study, (i) the portion of the cohort of a cancer type who acquired the observed type due to mostly inherited genomic susceptibility factors ranges from about 33 to 88% (or its corollary: the portion due to mostly environmental and lifestyle factors ranges from 12 to 67%), and (ii) on an individual level, the method also predicts individuals’ inherited genomic susceptibility to acquire the other types ranked with associated probabilities. These probabilities may provide practical information for individuals, heath professionals, and health policymakers related to prevention and/or early intervention of cancer.


2021 ◽  
Author(s):  
Prottoy Saha ◽  
Rudra Das ◽  
Shanta Kumar Das

Abstract Brain Cancer is quite possibly the most driving reason for death in recent years. Appropriate diagnosis of the cancer type empowers the specialists to make the right choice of treatment, decision and to save the patient's life. It goes no saying the importance of a computer-aided diagnosis system with image processing that can classify the tumor types correctly. In this paper, an enhanced approach has been proposed, that can classify brain tumor types from Magnetic Resonance Images (MRI) using deep learning and an ensemble of Machine Learning Algorithms. The system named BCM-VEMT can classify among four different classes that consist of three categories of Brain Cancers (Glioma, Meningioma, and Pituitary) and Non-Cancerous which means Normal type. A Convolutional Neural Network is developed to extract deep features from the MRI images. Then these extracted deep features are fed into multi-class ML classifiers to classify among these cancer types. Finally, a weighted average ensemble of classifiers is used to achieve better performance by combining the results of each ML classifier. The dataset of the system has a total of 3787 MRI images of four classes. BCM-VEMT has achieved better performance with 97.90% accuracy for the Glioma class, 98.94% accuracy for the Meningioma class, 98.00% accuracy for the Normal class, 98.92% accuracy for the Pituitary class, and overall accuracy of 98.42%. BCM-VEMT can have a great significance in classifying Brain Cancer types.


2022 ◽  
Author(s):  
Sy Hwang ◽  
Ryan Urbanowicz ◽  
Selah Lynch ◽  
Tawnya Vernon ◽  
Kellie Bresz ◽  
...  

Purpose: Predicting 30-day readmission risk is paramount to improving the quality of patient care. Previous studies have examined clinical risk factors associated with hospital readmissions. In this study, we compare sets of patient, provider, and community-level variables that are available at two different points of a patient's inpatient encounter (first 48 hours and the full encounter) to train readmission prediction models in order to identify and target appropriate actionable interventions that can potentially reduce avoidable readmissions. Methods: Using EHR data from a retrospective cohort of 2460 oncology patients, two sets of binary classification models predicting 30-day readmission were developed; one trained on variables that are available within the first 48 hours of admission and another trained on data from the entire hospital encounter. A comprehensive machine learning analysis pipeline was leveraged including preprocessing and feature transformation, feature importance and selection, machine learning modeling, and post-analysis. Results: Leveraging all features, the LGB (light gradient boosted machines) model produced higher, but comparable performance: (AUC: 0.711 and APS: 0.225) compared to Epic (AUC: 0.697 and APS: 0.221). Given features in the first 48-hours, the random forest model produces higher AUC (0.684), but lower PRC (0.18) and APS (0.184) than the Epic model (AUC: 0.676). In terms of the characteristics of patients flagged by these models, both the full LGB and 48-hour (random forest) feature models were highly sensitive in flagging more patients than the Epic models. Both models flagged patients with a similar distribution of race and sex; however, our LGB and random forest models more inclusive flagging more patients among younger age groups. The Epic models were more sensitive to identifying patients with an average lower zip income. Our 48-hour models were powered by novel features at various levels: patient (weight changeover 365 days, depression symptoms, laboratory values, cancer type), provider (winter discharge, hospital admission type), community (zip income, marital status of partner). Conclusion: We demonstrated that we could develop and validate models comparable to existing Epic 30-day readmission models, but provide several actionable insights that could create service interventions deployed by the case management or discharge planning teams that may decrease readmission rates over time.


Author(s):  
Chih-Hsiang Yang ◽  
Jaclyn P Maher ◽  
Aditya Ponnada ◽  
Eldin Dzubur ◽  
Rachel Nordgren ◽  
...  

Abstract People differ from each other to the extent to which momentary factors, such as context, mood, and cognitions, influence momentary health behaviors. However, statistical models to date are limited in their ability to test whether the association between two momentary variables (i.e., subject-level slopes) predicts a subject-level outcome. This study demonstrates a novel two-stage statistical modeling strategy that is capable of testing whether subject-level slopes between two momentary variables predict subject-level outcomes. An empirical case study application is presented to examine whether there are differences in momentary moderate-to-vigorous physical activity (MVPA) levels between the outdoor and indoor context in adults and whether these momentary differences predict mean daily MVPA levels 6 months later. One hundred and eight adults from a multiwave longitudinal study provided 4 days of ecological momentary assessment (during baseline) and accelerometry data (both at baseline and 6 month follow-up). Multilevel data were analyzed using an open-source program (MixWILD) to test whether momentary strength between outdoor context and MVPA during baseline was associated with average daily MVPA levels measured 6 months later. During baseline, momentary MVPA levels were higher in outdoor contexts as compared to indoor contexts (b = 0.07, p < .001). Participants who had more momentary MVPA when outdoors (vs. indoors) during baseline (i.e., a greater subject-level slope) had higher daily MVPA at the 6 month follow-up (b = 0.09, p < .05). This empirical example shows that the subject-level momentary association between specific context (i.e., outdoors) and health behavior (i.e., physical activity) may contribute to overall engagement in that behavior in the future. The demonstrated two-stage modeling approach has extensive applications in behavioral medicine to analyze intensive longitudinal data collected from wearable sensors and mobile devices.


2021 ◽  
Author(s):  
Thi Minh Kha Nguyen ◽  
Astrid Behnert ◽  
Torsten Pietsch ◽  
Christian Vokuhl ◽  
Christian Peter Kratz

Abstract In children with cancer, specific clinical features such as physical anomalies, occurrence of cancer in young relatives, specific cancer histologies, and unique mutation/methylation signatures may indicate the presence of an underlying cancer predisposition syndrome (CPS). The proportion of children with a cancer type suggesting a CPS among all children with cancer is unknown. To determine the proportion of children with cancer types suggesting an underlying CPS among children with cancer. We evaluated the number of children with cancer types strongly associated with CPS diagnosed in Germany between 2007 and 2016. Data were obtained from various sources including two national pediatric pathology reference laboratories for brain and solid tumors, respectively, various childhood cancer trial offices as well as the German Childhood Cancer Registry. Among 21,127 children diagnosed with cancer between 2007 and 2016, 2554 (12.1%) had a cancer type strongly associated with a CPS. The most common diagnoses were myelodysplastic syndrome and juvenile myelomonocytic leukemia, retinoblastoma, malignant peripheral nerve sheath tumor, infantile myofibromatosis, medulloblastomaSHH, rhabdoid tumor as well as atypical teratoid/rhabdoid tumor. Based on cancer type only, 12.1% of all children with cancer have an indication for a genetic evaluation. Pediatric oncology patients require access to genetic counselling and testing.


SLEEP ◽  
2021 ◽  
Vol 44 (Supplement_2) ◽  
pp. A166-A166
Author(s):  
Ankita Paul ◽  
Karen Wong ◽  
Anup Das ◽  
Diane Lim ◽  
Miranda Tan

Abstract Introduction Cancer patients are at an increased risk of moderate-to-severe obstructive sleep apnea (OSA). The STOP-Bang score is a commonly used screening questionnaire to assess risk of OSA in the general population. We hypothesize that cancer-relevant features, like radiation therapy (RT), may be used to determine the risk of OSA in cancer patients. Machine learning (ML) with non-parametric regression is applied to increase the prediction accuracy of OSA risk. Methods Ten features namely STOP-Bang score, history of RT to the head/neck/thorax, cancer type, cancer stage, metastasis, hypertension, diabetes, asthma, COPD, and chronic kidney disease were extracted from a database of cancer patients with a sleep study. The ML technique, K-Nearest-Neighbor (KNN), with a range of k values (5 to 20), was chosen because, unlike Logistic Regression (LR), KNN is not presumptive of data distribution and mapping function, and supports non-linear relationships among features. A correlation heatmap was computed to identify features having high correlation with OSA. Principal Component Analysis (PCA) was performed on the correlated features and then KNN was applied on the components to predict the risk of OSA. Receiver Operating Characteristic (ROC) - Area Under Curve (AUC) and Precision-Recall curves were computed to compare and validate performance for different test sets and majority class scenarios. Results In our cohort of 174 cancer patients, the accuracy in determining OSA among cancer patients using STOP-Bang score was 82.3% (LR) and 90.69% (KNN) but reduced to 89.9% in KNN using all 10 features mentioned above. PCA + KNN application using STOP-Bang score and RT as features, increased prediction accuracy to 94.1%. We validated our ML approach using a separate cohort of 20 cancer patients; the accuracies in OSA prediction were 85.57% (LR), 91.1% (KNN), and 92.8% (PCA + KNN). Conclusion STOP-Bang score and history of RT can be useful to predict risk of OSA in cancer patients with the PCA + KNN approach. This ML technique can refine screening tools to improve prediction accuracy of OSA in cancer patients. Larger studies investigating additional features using ML may improve OSA screening accuracy in various populations Support (if any):


Sign in / Sign up

Export Citation Format

Share Document