scholarly journals Weighted Cox regression for the prediction of heterogeneous patient subgroups

2021 ◽  
Vol 21 (1) ◽  
Author(s):  
Katrin Madjar ◽  
Jörg Rahnenführer

Abstract Background An important task in clinical medicine is the construction of risk prediction models for specific subgroups of patients based on high-dimensional molecular measurements such as gene expression data. Major objectives in modeling high-dimensional data are good prediction performance and feature selection to find a subset of predictors that are truly associated with a clinical outcome such as a time-to-event endpoint. In clinical practice, this task is challenging since patient cohorts are typically small and can be heterogeneous with regard to their relationship between predictors and outcome. When data of several subgroups of patients with the same or similar disease are available, it is tempting to combine them to increase sample size, such as in multicenter studies. However, heterogeneity between subgroups can lead to biased results and subgroup-specific effects may remain undetected. Methods For this situation, we propose a penalized Cox regression model with a weighted version of the Cox partial likelihood that includes patients of all subgroups but assigns them individual weights based on their subgroup affiliation. The weights are estimated from the data such that patients who are likely to belong to the subgroup of interest obtain higher weights in the subgroup-specific model. Results Our proposed approach is evaluated through simulations and application to real lung cancer cohorts, and compared to existing approaches. Simulation results demonstrate that our proposed model is superior to standard approaches in terms of prediction performance and variable selection accuracy when the sample size is small. Conclusions The results suggest that sharing information between subgroups by incorporating appropriate weights into the likelihood can increase power to identify the prognostic covariates and improve risk prediction.

Cancers ◽  
2021 ◽  
Vol 13 (14) ◽  
pp. 3533
Author(s):  
Paul Lacaze ◽  
Andrew Bakshi ◽  
Moeen Riaz ◽  
Suzanne G. Orchard ◽  
Jane Tiller ◽  
...  

Genomic risk prediction models for breast cancer (BC) have been predominantly developed with data from women aged 40–69 years. Prospective studies of older women aged ≥70 years have been limited. We assessed the effect of a 313-variant polygenic risk score (PRS) for BC in 6339 older women aged ≥70 years (mean age 75 years) enrolled into the ASPREE trial, a randomized double-blind placebo-controlled clinical trial investigating the effect of daily 100 mg aspirin on disability-free survival. We evaluated incident BC diagnoses over a median follow-up time of 4.7 years. A multivariable Cox regression model including conventional BC risk factors was applied to prospective data, and re-evaluated after adding the PRS. We also assessed the association of rare pathogenic variants (PVs) in BC susceptibility genes (BRCA1/BRCA2/PALB2/CHEK2/ATM). The PRS, as a continuous variable, was an independent predictor of incident BC (hazard ratio (HR) per standard deviation (SD) = 1.4, 95% confidence interval (CI) 1.3–1.6) and hormone receptor (ER/PR)-positive disease (HR = 1.5 (CI 1.2–1.9)). Women in the top quintile of the PRS distribution had over two-fold higher risk of BC than women in the lowest quintile (HR = 2.2 (CI 1.2–3.9)). The concordance index of the model without the PRS was 0.62 (95% CI 0.56–0.68), which improved after addition of the PRS to 0.65 (95% CI 0.59–0.71). Among 41 (0.6%) carriers of PVs in BC susceptibility genes, we observed no incident BC diagnoses. Our study demonstrates that a PRS predicts incident BC risk in women aged 70 years and older, suggesting potential clinical utility extends to this older age group.


Processes ◽  
2021 ◽  
Vol 9 (10) ◽  
pp. 1804
Author(s):  
John Ndisya ◽  
Ayub Gitau ◽  
Duncan Mbuge ◽  
Arman Arefi ◽  
Liliana Bădulescu ◽  
...  

In this study, hyperspectral imaging (HSI) and chemometrics were implemented to develop prediction models for moisture, colour, chemical and structural attributes of purple-speckled cocoyam slices subjected to hot-air drying. Since HSI systems are costly and computationally demanding, the selection of a narrow band of wavelengths can enable the utilisation of simpler multispectral systems. In this study, 19 optimal wavelengths in the spectral range 400–1700 nm were selected using PLS-BETA and PLS-VIP feature selection methods. Prediction models for the studied quality attributes were developed from the 19 wavelengths. Excellent prediction performance (RMSEP < 2.0, r2P > 0.90, RPDP > 3.5) was obtained for MC, RR, VS and aw. Good prediction performance (RMSEP < 8.0, r2P = 0.70–0.90, RPDP > 2.0) was obtained for PC, BI, CIELAB b*, chroma, TFC, TAA and hue angle. Additionally, PPA and WI were also predicted successfully. An assessment of the agreement between predictions from the non-invasive hyperspectral imaging technique and experimental results from the routine laboratory methods established the potential of the HSI technique to replace or be used interchangeably with laboratory measurements. Additionally, a comparison of full-spectrum model results and the reduced models demonstrated the potential replacement of HSI with simpler imaging systems.


2020 ◽  
Vol 4 (1) ◽  
Author(s):  
Alexander Pate ◽  
Richard Emsley ◽  
Matthew Sperrin ◽  
Glen P. Martin ◽  
Tjeerd van Staa

Abstract Background Stability of risk estimates from prediction models may be highly dependent on the sample size of the dataset available for model derivation. In this paper, we evaluate the stability of cardiovascular disease risk scores for individual patients when using different sample sizes for model derivation; such sample sizes include those similar to models recommended in the national guidelines, and those based on recently published sample size formula for prediction models. Methods We mimicked the process of sampling N patients from a population to develop a risk prediction model by sampling patients from the Clinical Practice Research Datalink. A cardiovascular disease risk prediction model was developed on this sample and used to generate risk scores for an independent cohort of patients. This process was repeated 1000 times, giving a distribution of risks for each patient. N = 100,000, 50,000, 10,000, Nmin (derived from sample size formula) and Nepv10 (meets 10 events per predictor rule) were considered. The 5–95th percentile range of risks across these models was used to evaluate instability. Patients were grouped by a risk derived from a model developed on the entire population (population-derived risk) to summarise results. Results For a sample size of 100,000, the median 5–95th percentile range of risks for patients across the 1000 models was 0.77%, 1.60%, 2.42% and 3.22% for patients with population-derived risks of 4–5%, 9–10%, 14–15% and 19–20% respectively; for N = 10,000, it was 2.49%, 5.23%, 7.92% and 10.59%, and for N using the formula-derived sample size, it was 6.79%, 14.41%, 21.89% and 29.21%. Restricting this analysis to models with high discrimination, good calibration or small mean absolute prediction error reduced the percentile range, but high levels of instability remained. Conclusions Widely used cardiovascular disease risk prediction models suffer from high levels of instability induced by sampling variation. Many models will also suffer from overfitting (a closely linked concept), but at acceptable levels of overfitting, there may still be high levels of instability in individual risk. Stability of risk estimates should be a criterion when determining the minimum sample size to develop models.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Yuichi Okinaga ◽  
Daisuke Kyogoku ◽  
Satoshi Kondo ◽  
Atsushi J. Nagano ◽  
Kei Hirose

AbstractThe least absolute shrinkage and selection operator (lasso) and principal component regression (PCR) are popular methods of estimating traits from high-dimensional omics data, such as transcriptomes. The prediction accuracy of these estimation methods is highly dependent on the covariance structure, which is characterized by gene regulation networks. However, the manner in which the structure of a gene regulation network together with the sample size affects prediction accuracy has not yet been sufficiently investigated. In this study, Monte Carlo simulations are conducted to investigate the prediction accuracy for several network structures under various sample sizes. When the gene regulation network is a random graph, a sufficiently large number of observations are required to ensure good prediction accuracy with the lasso. The PCR provided poor prediction accuracy regardless of the sample size. However, a real gene regulation network is likely to exhibit a scale-free structure. In such cases, the simulation indicates that a relatively small number of observations, such as $$N=300$$ N = 300 , is sufficient to allow the accurate prediction of traits from a transcriptome with the lasso.


2020 ◽  
Author(s):  
Alexander Pate ◽  
Richard Emsley ◽  
Matthew Sperrin ◽  
Glen P. Martin ◽  
Tjeerd van Staa

Abstract Background Stability of risk estimates from prediction models may be highly dependent on the sample size of the dataset available for model derivation. In this paper, we evaluate the stability of cardiovascular disease risk scores for individual patients when using different sample sizes for model derivation; such sample sizes include those similar to models recommended in national guidelines, and those based on recently published sample size formula for prediction models. Methods We mimicked the process of sampling N patients from a population to develop a risk prediction model by sampling patients from the Clinical Practice Research Datalink. A cardiovascular disease risk prediction model was developed on this sample and used to generate risk scores for an independent cohort of patients. This process was repeated 1000 times, giving a distribution of risks for each patient. N = 100 000, 50 000, 10 000 and N min (derived from sample size formula) were considered. The 2.5 – 97.5 percentile range of risks across these models was used to evaluate instability. Patients were grouped by a risk derived from a model developed on the entire population (population derived risk) to summarise results. Results For a sample size of 10 000, the median 2.5 – 97.5 percentile range of risks for patients across the 1000 models was approximately 60% of their population derived risk. For example, for patients with a population derived risk of 9 - 10% or 19 - 20%, the median percentile range was 6.25% and 12.59% respectively. Using the formula derived sample size, the range was approximately 170% of their average risk score. Restricting this analysis to models with high discrimination or good calibration reduced the percentile range, but high levels of instability remained. Conclusions Widely used cardiovascular disease risk prediction models suffer from high levels of instability induced by sampling variation. Stability of risk estimates should be a criterion when determining the minimum sample size to develop models.


Mathematics ◽  
2021 ◽  
Vol 9 (11) ◽  
pp. 1244
Author(s):  
Lin Hao ◽  
Juncheol Kim ◽  
Sookhee Kwon ◽  
Il Do Ha

With the development of high-throughput technologies, more and more high-dimensional or ultra-high-dimensional genomic data are being generated. Therefore, effectively analyzing such data has become a significant challenge. Machine learning (ML) algorithms have been widely applied for modeling nonlinear and complicated interactions in a variety of practical fields such as high-dimensional survival data. Recently, multilayer deep neural network (DNN) models have made remarkable achievements. Thus, a Cox-based DNN prediction survival model (DNNSurv model), which was built with Keras and TensorFlow, was developed. However, its results were only evaluated on the survival datasets with high-dimensional or large sample sizes. In this paper, we evaluated the prediction performance of the DNNSurv model using ultra-high-dimensional and high-dimensional survival datasets and compared it with three popular ML survival prediction models (i.e., random survival forest and the Cox-based LASSO and Ridge models). For this purpose, we also present the optimal setting of several hyperparameters, including the selection of a tuning parameter. The proposed method demonstrated via data analysis that the DNNSurv model performed well overall as compared with the ML models, in terms of the three main evaluation measures (i.e., concordance index, time-dependent Brier score, and the time-dependent AUC) for survival prediction performance.


2021 ◽  
Vol 5 (Supplement_1) ◽  
pp. A417-A418
Author(s):  
Amanda Yun Rui Lam ◽  
Min Min Chan ◽  
David Carmody ◽  
Ming Ming Teh ◽  
Yong Mong Bee ◽  
...  

Abstract Background: South-East Asia has seen a dramatic increase in type 2 diabetes (T2D). Risk prediction models for Major adverse cardiovascular events (MACE) identify patients who may benefit most from intensive prevention strategies. Existing risk prediction models for T2D were developed mainly in Caucasian populations, limiting their generalizability to Asian populations. We developed a Lasso-Cox regression model to predict the 5-year risk of incident MACE in Asian patients with T2DM using data from the largest diabetes registry in Singapore. Methodology: The diabetes registry contained public healthcare data from 9 primary healthcare centers, 4 hospitals and 3 national specialty centers. Data from 120,131 T2D subjects without MACE at baseline, from 2008 to 2018, were used for model development and validation. Patients with less than 5 years of follow-up data were excluded. Lasso-Cox, a semi-parametric variant of the Cox Proportional Hazard Model with l1-regularization, was used to predict individual survival distribution of incident MACE. A total of 69 features within electronic health records, including demographic data, vital signs, laboratory tests, and prescriptions for blood pressure, lipid and glucose-lowering medication were supplied to the model. Regression shrinkage and selection via the lasso method was used to identify variables associated with incident MACE. Identified variables were used to generate individual survival probability curves. Incident MACE was defined as the first occurrence of nonfatal myocardial infarction, nonfatal stroke, and CV disease-related death. Results: A total of 12,535 (10.4%) subjects developed MACE between 2008 and 2018. Model performance was evaluated by time-dependent concordance index and Brier score at 1, 2 and 5 years. The results of 5-fold cross validation shows that the model displayed good discrimination, achieving time-dependent C-statistics of 0.746±0.005, 0.742±0.003 and 0.738±0.002 at 1, 2 and 5 years respectively. The model demonstrated low Brier scores of 0.0355±0.0004, 0.0601±0.0011, 0.104±0.004 at 1, 2 and 5 years respectively, indicating good calibration. Factors most predictive of MACE were age and a history of hypertension and hyperlipidemia. Conclusions: We have developed a risk prediction model for MACE in Asian T2D using a large Singaporean T2D cohort, which can be used to support clinical decision-making. The individual survival probability estimates achieve an average C-statistics of 0.742 and are well-calibrated at 1, 2 and 5 years.


2020 ◽  
Author(s):  
Jie Wang ◽  
Chao Li ◽  
Jing Li ◽  
Sheng Qin ◽  
Chunlei Liu ◽  
...  

Abstract Background. The prevalence of metabolic syndrome continues to rise sharply worldwide, seriously threatening people's health.In this paper, three kinds of risk prediction models applicable to the metabolic syndrome of oil workers were established, and the optimal models were found through comparison. The optimal model can be used to identify people at high risk of metabolic syndrome as early as possible, to predict their risk, and to persuade them to change their adverse lifestyle so as to slow down and reduce the incidence of metabolic syndrome.Methods. A total of 1,468 workers from an oil company who participated in occupational health physical examination from April 2017 to October 2018 were included in this study. We established the Logistic regression model, the random forest model and the convolutional neural network model, and compared the prediction performance of the models according to the F1 score, sensitivity, accuracy and other indicators of the three models.Results. The results showed that the accuracy of the three models in the training set was 83.45%, 94.21% and 86.34%, the sensitivity was 78.47%, 94.62% and 81.30%, the F1 score was 0.79, 0.93 and 0.83, and the area under the ROC curve was 0.894, 0.987 and 0.935, respectively. In the test set, the accuracy was 76.72%, 80.66% and 78.69%, the sensitivity was 70.00%, 77.50% and 68.33%, the F1 score was 0.70, 0.76 and 0.71, and the area under the ROC curve was 0.797, 0.861 and 0.855, respectively.Conclusions. The study showed that the prediction performance of random forest model is better than other models, and the model has higher application value, which can better predict the risk of metabolic syndrome in oil workers, and provide corresponding theoretical basis for the health management of oil workers.


Sign in / Sign up

Export Citation Format

Share Document