scholarly journals Predictive model for the 5-year survival status of osteosarcoma patients based on the SEER database and XGBoost algorithm

2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Jiuzhou Jiang ◽  
Hao Pan ◽  
Mobai Li ◽  
Bao Qian ◽  
Xianfeng Lin ◽  
...  

AbstractOsteosarcoma is the most common bone malignancy, with the highest incidence in children and adolescents. Survival rate prediction is important for improving prognosis and planning therapy. However, there is still no prediction model with a high accuracy rate for osteosarcoma. Therefore, we aimed to construct an artificial intelligence (AI) model for predicting the 5-year survival of osteosarcoma patients by using extreme gradient boosting (XGBoost), a large-scale machine-learning algorithm. We identified cases of osteosarcoma in the Surveillance, Epidemiology, and End Results (SEER) Research Database and excluded substandard samples. The study population was 835 and was divided into the training set (n = 668) and validation set (n = 167). Characteristics selected via survival analyses were used to construct the model. Receiver operating characteristic (ROC) curve and decision curve analyses were performed to evaluate the prediction. The accuracy of the prediction model was excellent both in the training set (area under the ROC curve [AUC] = 0.977) and the validation set (AUC = 0.911). Decision curve analyses proved the model could be used to support clinical decisions. XGBoost is an effective algorithm for predicting 5-year survival of osteosarcoma patients. Our prediction model had excellent accuracy and is therefore useful in clinical settings.

2021 ◽  
Vol 11 ◽  
Author(s):  
Yinghao Meng ◽  
Hao Zhang ◽  
Qi Li ◽  
Fang Liu ◽  
Xu Fang ◽  
...  

PurposeTo develop and validate a machine learning classifier based on multidetector computed tomography (MDCT), for the preoperative prediction of tumor–stroma ratio (TSR) expression in patients with pancreatic ductal adenocarcinoma (PDAC).Materials and MethodsIn this retrospective study, 227 patients with PDAC underwent an MDCT scan and surgical resection. We quantified the TSR by using hematoxylin and eosin staining and extracted 1409 arterial and portal venous phase radiomics features for each patient, respectively. Moreover, we used the least absolute shrinkage and selection operator logistic regression algorithm to reduce the features. The extreme gradient boosting (XGBoost) was developed using a training set consisting of 167 consecutive patients, admitted between December 2016 and December 2017. The model was validated in 60 consecutive patients, admitted between January 2018 and April 2018. We determined the XGBoost classifier performance based on its discriminative ability, calibration, and clinical utility.ResultsWe observed low and high TSR in 91 (40.09%) and 136 (59.91%) patients, respectively. A log-rank test revealed significantly longer survival for patients in the TSR-low group than those in the TSR-high group. The prediction model revealed good discrimination in the training (area under the curve [AUC]= 0.93) and moderate discrimination in the validation set (AUC= 0.63). While the sensitivity, specificity, accuracy, positive predictive value, and negative predictive value for the training set were 94.06%, 81.82%, 0.89, 0.89, and 0.90, respectively, those for the validation set were 85.71%, 48.00%, 0.70, 0.70, and 0.71, respectively.ConclusionsThe CT radiomics-based XGBoost classifier provides a potentially valuable noninvasive tool to predict TSR in patients with PDAC and optimize risk stratification.


2021 ◽  
Vol 11 (17) ◽  
pp. 7793
Author(s):  
Alessandro Massaro ◽  
Antonio Panarese ◽  
Daniele Giannone ◽  
Angelo Galiano

The organized large-scale retail sector has been gradually establishing itself around the world, and has increased activities exponentially in the pandemic period. This modern sales system uses Data Mining technologies processing precious information to increase profit. In this direction, the extreme gradient boosting (XGBoost) algorithm was applied in an industrial project as a supervised learning algorithm to predict product sales including promotion condition and a multiparametric analysis. The implemented XGBoost model was trained and tested by the use of the Augmented Data (AD) technique in the event that the available data are not sufficient to achieve the desired accuracy, as for many practical cases of artificial intelligence data processing, where a large dataset is not available. The prediction was applied to a grid of segmented customers by allowing personalized services according to their purchasing behavior. The AD technique conferred a good accuracy if compared with results adopting the initial dataset with few records. An improvement of the prediction error, such as the Root Mean Square Error (RMSE) and Mean Square Error (MSE), which decreases by about an order of magnitude, was achieved. The AD technique formulated for large-scale retail sector also represents a good way to calibrate the training model.


2020 ◽  
Vol 38 (15_suppl) ◽  
pp. e16801-e16801
Author(s):  
Daniel R Cherry ◽  
Qinyu Chen ◽  
James Don Murphy

e16801 Background: Pancreatic cancer has an insidious presentation with four-in-five patients presenting with disease not amenable to potentially curative surgery. Efforts to screen patients for pancreatic cancer using population-wide strategies have proven ineffective. We applied a machine learning approach to create an early prediction model drawing on the content of patients’ electronic health records (EHRs). Methods: We used patient data from OptumLabs which included de-identified data extracted from patient EHRs collected between 2009 and 2017. We identified patients diagnosed with pancreatic cancer at age 40 or later, which we categorized into early-stage pancreatic cancer (ESPC; n = 3,322) and late-stage pancreatic cancer (LSPC; n = 25,908) groups. ESPC cases were matched to non-pancreatic cancer controls in a ratio of 1:16 based on diagnosis year and geographic division, and the cohort was divided into training (70%) and test (30%) sets. The prediction model was built using an eXtreme Gradient Boosting machine learning algorithm of ESPC patients’ EHRs in the year preceding diagnosis, with features including patient demographics, procedure and clinical diagnosis codes, clinical notes and medications. Model discrimination was assessed with sensitivity, specificity, positive predictive value (PPV) and area under the curve (AUC) with a score of 1.0 indicating perfect prediction. Results: The final AUC in the test set was 0.841, and the model included 583 features, of which 248 (42.5%) were physician note elements, 146 (25.0%) were procedure codes, 91 (15.6%) were diagnosis codes, 89 (15.3%) were medications and 9 (1.54%) were demographic features. The most important features were history of pancreatic disorders (not diabetes or cancer), age, income, biliary tract disease, education level, obstructive jaundice and abdominal pain. We evaluated model performance at varying classification thresholds. When applied to patients over 40 choosing a threshold with a sensitivity of 20% produced a specificity of 99.9% and a PPV of 2.5%. The model PPV increased with age; for patients over 80, PPV was 8.0%. LSPC patients identified by the model would have been detected a median of 4 months before their actual diagnosis, with a quarter of these patients identified at least 14 months earlier. Conclusions: Using EHR data to identify early-stage pancreatic cancer patients shows promise. While widespread use of this approach on an unselected population would produce high rates of false positives, this technique could be employed among high risk patients, or paired with other screening tools.


2021 ◽  
Author(s):  
Jiacheng Shi ◽  
Xiaohuan Chen ◽  
Qiong Yang ◽  
Cai-Mei Wang ◽  
Qian Huang ◽  
...  

Abstract Currently, the most widely used screening methods for hyperuricemia (HUA) involves invasive laboratory tests, which are lacking in many rural hostipals in China. This study explores the use of non-invasive physical examinations to construct a simple prediction model for HUA. Data of 9,252 adults from July to October 2019 in the Affiliated Hospital of Guilin Medical College were collected and divided randomly into a training set (n = 6,364) and a validation set (n = 2,888) at a ratio of 7:3. In the training set, non-invasive physical examination indicators of age, gender, body mass index (BMI) and prevalence of hypertension were included for logistic regression analysis, and a nomogram model was established. The classification and regression tree (CART) algorithm of the decision tree model was used to build a classification tree model. Receiver operating characteristic (ROC) curve, calibration curve and decision curve analyses (DCA) were used to test the distinction, accuracy and clinical applicability of the two models. The results showed age, gender, BMI and prevalence of hypertension were all related to the occurrence of HUA. The area under the ROC curve (AUC) of the nomogram model was 0.806 and 0.791 in training set and validation set, respectively. The AUC of the classification tree model was 0.802 and 0.794 in the two sets, respectively, but were not statistically different. The calibration curves and DCAs of the two models performed well on accuracy and clinical practicality, which suggested these models may be suitable to predict HUA for rural setting.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Jia-Cheng Shi ◽  
Xiao-Huan Chen ◽  
Qiong Yang ◽  
Cai-Mei Wang ◽  
Qian Huang ◽  
...  

AbstractCurrently, the most widely used screening methods for hyperuricemia (HUA) involves invasive laboratory tests, which are lacking in many rural hospitals in China. This study explored the use of non-invasive physical examinations to construct a simple prediction model for HUA, in order to reduce the economic burden and invasive operations such as blood sampling, and provide some help for the health management of people in poor areas with backward medical resources. Data of 9252 adults from April to June 2017 in the Affiliated Hospital of Guilin Medical College were collected and divided randomly into a training set (n = 6364) and a validation set (n = 2888) at a ratio of 7:3. In the training set, non-invasive physical examination indicators of age, gender, body mass index (BMI) and prevalence of hypertension were included for logistic regression analysis, and a nomogram model was established. The classification and regression tree (CART) algorithm of the decision tree model was used to build a classification tree model. Receiver operating characteristic (ROC) curve, calibration curve and decision curve analyses (DCA) were used to test the distinction, accuracy and clinical applicability of the two models. The results showed age, gender, BMI and prevalence of hypertension were all related to the occurrence of HUA. The area under the ROC curve (AUC) of the nomogram model was 0.806 and 0.791 in training set and validation set, respectively. The AUC of the classification tree model was 0.802 and 0.794 in the two sets, respectively, but were not statistically different. The calibration curves and DCAs of the two models performed well on accuracy and clinical practicality, which suggested these models may be suitable to predict HUA for rural setting.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Li Liu ◽  
Zhiyong Chen ◽  
Yingrong Du ◽  
Jianpeng Gao ◽  
Junyi Li ◽  
...  

AbstractTo evaluate the predictive effect of T-lymphoid subsets on the conversion of common covid-19 to severe. The laboratory data were collected retrospectively from common covid-19 patients in the First People's Hospital of Zaoyang, Hubei Province, China and the Third People's Hospital of Kunming, Yunnan Province, China, between January 20, 2020 and March 15, 2020 and divided into training set and validation set. Univariate and multivariate logistic regression was performed to investigate the risk factors for the conversion of common covid-19 to severe in the training set, the prediction model was established and verified externally in the validation set. 60 (14.71%) of 408 patients with common covid-19 became severe in 6–10 days after diagnosis. Univariate and multiple logistic regression analysis revealed that lactate (P = 0.042, OR = 1097.983, 95% CI 1.303, 924,798.262) and CD8+ T cells (P = 0.010, OR = 0.903, 95% CI 0.835, 0.975) were independent risk factors for general type patients to turn to severe type. The area under ROC curve of lactate and CD8+ T cells was 0.754 (0.581, 0.928) and 0.842 (0.713, 0.970), respectively. The actual observation value was highly consistent with the prediction model value in curve fitting. The established prediction model was verified in 78 COVID-19 patients in the verification set, the area under the ROC curve was 0.906 (0.861, 0.981), and the calibration curve was consistent. CD8+ T cells, as an independent risk factor, could predict the transition from common covid-19 to severe.


Stroke ◽  
2013 ◽  
Vol 44 (suppl_1) ◽  
Author(s):  
Kohkichi Hosoda ◽  
Nobuyuki Akutsu ◽  
Atsushi Fujita ◽  
Eiji Kohmura

[Objective] Recently, we reported a preliminary prediction model with carotid plaque MRI to estimate risk for new ischaemic brain lesions after CEA or CAS. The objective of this study was to validate this model in new set of patients with carotid stenosis. [Methods] One hundred four patients with carotid stenosis undergoing treatment (63 CEA, 41 CAS) were used as a training set for construction of a preliminary prediction model to estimate risk for new ischemic brain lesions after CEA or CAS. T1 and T2 signal intensity of carotid plaque were measured on black-blood MRI. Associations among MRI findings, treatment, clinical factors, and occurrence of new ischemic lesions on DWI 1 day after treatment were studied by logistic regression. The validity of the prediction model was examined using a new set of patients with carotid stenosis (n = 43) as a validation set. [Results] In the training set, new DWI lesions after treatment were observed in 25 patients (24%). The model demonstrated that T1-signal intensity and CAS were positively associated with new lesions on post-treatment DWI scans, and T2 signal intensity was negatively associated (Fig. 1). The C-index was 0.79, which indicated some predictive value. In the validation set, new DWI lesions after treatment were observed in 10 patients (23%). However, C-index was 0.6 and positive predictive value was 33% (Fig. 2), which suggested overfitting of our model and/or differences in case-mix between the training set and validation set. [Conclusions] Our preliminary prediction model may provide some useful information for decision-making regarding treatment strategy, but needs further collection of patients to improve its predictive value.


2020 ◽  
Vol 34 (04) ◽  
pp. 6853-6860
Author(s):  
Xuchao Zhang ◽  
Xian Wu ◽  
Fanglan Chen ◽  
Liang Zhao ◽  
Chang-Tien Lu

The success of training accurate models strongly depends on the availability of a sufficient collection of precisely labeled data. However, real-world datasets contain erroneously labeled data samples that substantially hinder the performance of machine learning models. Meanwhile, well-labeled data is usually expensive to obtain and only a limited amount is available for training. In this paper, we consider the problem of training a robust model by using large-scale noisy data in conjunction with a small set of clean data. To leverage the information contained via the clean labels, we propose a novel self-paced robust learning algorithm (SPRL) that trains the model in a process from more reliable (clean) data instances to less reliable (noisy) ones under the supervision of well-labeled data. The self-paced learning process hedges the risk of selecting corrupted data into the training set. Moreover, theoretical analyses on the convergence of the proposed algorithm are provided under mild assumptions. Extensive experiments on synthetic and real-world datasets demonstrate that our proposed approach can achieve a considerable improvement in effectiveness and robustness to existing methods.


2019 ◽  
Vol 11 (12) ◽  
pp. 1505 ◽  
Author(s):  
Heng Zhang ◽  
Anwar Eziz ◽  
Jian Xiao ◽  
Shengli Tao ◽  
Shaopeng Wang ◽  
...  

Accurate mapping of vegetation is a premise for conserving, managing, and sustainably using vegetation resources, especially in conditions of intensive human activities and accelerating global changes. However, it is still challenging to produce high-resolution multiclass vegetation map in high accuracy, due to the incapacity of traditional mapping techniques in distinguishing mosaic vegetation classes with subtle differences and the paucity of fieldwork data. This study created a workflow by adopting a promising classifier, extreme gradient boosting (XGBoost), to produce accurate vegetation maps of two strikingly different cases (the Dzungarian Basin in China and New Zealand) based on extensive features and abundant vegetation data. For the Dzungarian Basin, a vegetation map with seven vegetation types, 17 subtypes, and 43 associations was produced with an overall accuracy of 0.907, 0.801, and 0.748, respectively. For New Zealand, a map of 10 habitats and a map of 41 vegetation classes were produced with 0.946, and 0.703 overall accuracy, respectively. The workflow incorporating simplified field survey procedures outperformed conventional field survey and remote sensing based methods in terms of accuracy and efficiency. In addition, it opens a possibility of building large-scale, high-resolution, and timely vegetation monitoring platforms for most terrestrial ecosystems worldwide with the aid of Google Earth Engine and citizen science programs.


2020 ◽  
Vol 21 (S13) ◽  
Author(s):  
Ke Li ◽  
Sijia Zhang ◽  
Di Yan ◽  
Yannan Bin ◽  
Junfeng Xia

Abstract Background Identification of hot spots in protein-DNA interfaces provides crucial information for the research on protein-DNA interaction and drug design. As experimental methods for determining hot spots are time-consuming, labor-intensive and expensive, there is a need for developing reliable computational method to predict hot spots on a large scale. Results Here, we proposed a new method named sxPDH based on supervised isometric feature mapping (S-ISOMAP) and extreme gradient boosting (XGBoost) to predict hot spots in protein-DNA complexes. We obtained 114 features from a combination of the protein sequence, structure, network and solvent accessible information, and systematically assessed various feature selection methods and feature dimensionality reduction methods based on manifold learning. The results show that the S-ISOMAP method is superior to other feature selection or manifold learning methods. XGBoost was then used to develop hot spots prediction model sxPDH based on the three dimensionality-reduced features obtained from S-ISOMAP. Conclusion Our method sxPDH boosts prediction performance using S-ISOMAP and XGBoost. The AUC of the model is 0.773, and the F1 score is 0.713. Experimental results on benchmark dataset indicate that sxPDH can achieve generally better performance in predicting hot spots compared to the state-of-the-art methods.


Sign in / Sign up

Export Citation Format

Share Document