scholarly journals Patient-Level Cancer Prediction Models From a Nationwide Patient Cohort: Model Development and Validation

10.2196/29807 ◽  
2021 ◽  
Vol 9 (8) ◽  
pp. e29807
Author(s):  
Eunsaem Lee ◽  
Se Young Jung ◽  
Hyung Ju Hwang ◽  
Jaewoo Jung

Background Nationwide population-based cohorts provide a new opportunity to build automated risk prediction models at the patient level, and claim data are one of the more useful resources to this end. To avoid unnecessary diagnostic intervention after cancer screening tests, patient-level prediction models should be developed. Objective We aimed to develop cancer prediction models using nationwide claim databases with machine learning algorithms, which are explainable and easily applicable in real-world environments. Methods As source data, we used the Korean National Insurance System Database. Every Korean in ≥40 years old undergoes a national health checkup every 2 years. We gathered all variables from the database including demographic information, basic laboratory values, anthropometric values, and previous medical history. We applied conventional logistic regression methods, light gradient boosting methods, neural networks, survival analysis, and one-class embedding classifier methods to effectively analyze high dimension data based on deep learning–based anomaly detection. Performance was measured with area under the curve and area under precision recall curve. We validated our models externally with a health checkup database from a tertiary hospital. Results The one-class embedding classifier model received the highest area under the curve scores with values of 0.868, 0.849, 0.798, 0.746, 0.800, 0.749, and 0.790 for liver, lung, colorectal, pancreatic, gastric, breast, and cervical cancers, respectively. For area under precision recall curve, the light gradient boosting models had the highest score with values of 0.383, 0.401, 0.387, 0.300, 0.385, 0.357, and 0.296 for liver, lung, colorectal, pancreatic, gastric, breast, and cervical cancers, respectively. Conclusions Our results show that it is possible to easily develop applicable cancer prediction models with nationwide claim data using machine learning. The 7 models showed acceptable performances and explainability, and thus can be distributed easily in real-world environments.

2021 ◽  
Author(s):  
Eunsaem Lee ◽  
Se Young Jung ◽  
Hyung Ju Hwang ◽  
Jaewoo Jung

BACKGROUND Nationwide population-based cohorts provide a new opportunity to build automated risk prediction models at the patient level, and claim data are one of the more useful resources to this end. To avoid unnecessary diagnostic intervention after cancer screening tests, patient-level prediction models should be developed. OBJECTIVE We aimed to develop cancer prediction models using nationwide claim databases with machine learning algorithms, which are explainable and easily applicable in real-world environments. METHODS As source data, we used the Korean National Insurance System Database. Every Korean in ≥40 years old undergoes a national health checkup every 2 years. We gathered all variables from the database including demographic information, basic laboratory values, anthropometric values, and previous medical history. We applied conventional logistic regression methods, light gradient boosting methods, neural networks, survival analysis, and one-class embedding classifier methods to effectively analyze high dimension data based on deep learning–based anomaly detection. Performance was measured with area under the curve and area under precision recall curve. We validated our models externally with a health checkup database from a tertiary hospital. RESULTS The one-class embedding classifier model received the highest area under the curve scores with values of 0.868, 0.849, 0.798, 0.746, 0.800, 0.749, and 0.790 for liver, lung, colorectal, pancreatic, gastric, breast, and cervical cancers, respectively. For area under precision recall curve, the light gradient boosting models had the highest score with values of 0.383, 0.401, 0.387, 0.300, 0.385, 0.357, and 0.296 for liver, lung, colorectal, pancreatic, gastric, breast, and cervical cancers, respectively. CONCLUSIONS Our results show that it is possible to easily develop applicable cancer prediction models with nationwide claim data using machine learning. The 7 models showed acceptable performances and explainability, and thus can be distributed easily in real-world environments.


Author(s):  
Oguz Akbilgic ◽  
Liam Butler ◽  
Ibrahim Karabayir ◽  
Patricia P Chang ◽  
Dalane W Kitzman ◽  
...  

Abstract Aims Heart failure (HF) is a leading cause of death. Early intervention is the key to reduce HF-related morbidity and mortality. This study assesses the utility of electrocardiograms (ECGs) in HF risk prediction. Methods and results Data from the baseline visits (1987–89) of the Atherosclerosis Risk in Communities (ARIC) study was used. Incident hospitalized HF events were ascertained by ICD codes. Participants with good quality baseline ECGs were included. Participants with prevalent HF were excluded. ECG-artificial intelligence (AI) model to predict HF was created as a deep residual convolutional neural network (CNN) utilizing standard 12-lead ECG. The area under the receiver operating characteristic curve (AUC) was used to evaluate prediction models including (CNN), light gradient boosting machines (LGBM), and Cox proportional hazards regression. A total of 14 613 (45% male, 73% of white, mean age ± standard deviation of 54 ± 5) participants were eligible. A total of 803 (5.5%) participants developed HF within 10 years from baseline. Convolutional neural network utilizing solely ECG achieved an AUC of 0.756 (0.717–0.795) on the hold-out test data. ARIC and Framingham Heart Study (FHS) HF risk calculators yielded AUC of 0.802 (0.750–0.850) and 0.780 (0.740–0.830). The highest AUC of 0.818 (0.778–0.859) was obtained when ECG-AI model output, age, gender, race, body mass index, smoking status, prevalent coronary heart disease, diabetes mellitus, systolic blood pressure, and heart rate were used as predictors of HF within LGBM. The ECG-AI model output was the most important predictor of HF. Conclusions ECG-AI model based solely on information extracted from ECG independently predicts HF with accuracy comparable to existing FHS and ARIC risk calculators.


2020 ◽  
Vol 4 (Supplement_1) ◽  
Author(s):  
Akihiro Nomura ◽  
Sho Yamamoto ◽  
Yuta Hayakawa ◽  
Kouki Taniguchi ◽  
Takuya Higashitani ◽  
...  

Abstract Diabetes mellitus (DM) is a chronic disorder, characterized by impaired glucose metabolism. It is linked to increased risks of several diseases such as atrial fibrillation, cancer, and cardiovascular diseases. Therefore, DM prevention is essential. However, the traditional regression-based DM-onset prediction methods are incapable of investigating future DM for generally healthy individuals without DM. Employing gradient-boosting decision trees, we developed a machine learning-based prediction model to identify the DM signatures, prior to the onset of DM. We employed the nationwide annual specific health checkup records, collected during the years 2008 to 2018, from Kanazawa city, Ishikawa, Japan. The data included the physical examinations, blood and urine tests, and participant questionnaires. Individuals without DM (at baseline), who underwent more than two annual health checkups during the said period, were included. The new cases of DM onset were recorded when the participants were diagnosed with DM in the annual check-ups. The dataset was divided into three subsets in a 6:2:2 ratio to constitute the training, tuning (internal validation), and testing datasets. Employing the testing dataset, the ability of our trained prediction model to calculate the area under the curve (AUC), precision, recall, F1 score, and overall accuracy was evaluated. Using a 1,000-iteration bootstrap method, every performance test resulted in a two-sided 95% confidence interval (CI). We included 509,153 annual health checkup records of 139,225 participants. Among them, 65,505 participants without DM were included, which constituted36,303 participants in the training dataset and 13,101 participants in each of the tuning and testing datasets. We identified a total of 4,696 new DM-onset patients (7.2%) in the study period. Our trained model predicted the future incidence of DM with the AUC, precision, recall, F1 score, and overall accuracy of 0.71 (0.69-0.72 with 95% CI), 75.3% (71.6-78.8), 42.2% (39.3-45.2), 54.1% (51.2-56.7), and 94.9% (94.5-95.2), respectively. In conclusion, the machine learning-based prediction model satisfactorily identified the DM onset prior to the actual incidence.


2020 ◽  
Vol 71 (16) ◽  
pp. 2079-2088 ◽  
Author(s):  
Kun Wang ◽  
Peiyuan Zuo ◽  
Yuwei Liu ◽  
Meng Zhang ◽  
Xiaofang Zhao ◽  
...  

Abstract Background This study aimed to develop mortality-prediction models for patients with coronavirus disease-2019 (COVID-19). Methods The training cohort included consecutive COVID-19 patients at the First People’s Hospital of Jiangxia District in Wuhan, China, from 7 January 2020 to 11 February 2020. We selected baseline data through the stepwise Akaike information criterion and ensemble XGBoost (extreme gradient boosting) model to build mortality-prediction models. We then validated these models by randomly collected COVID-19 patients in Union Hospital, Wuhan, from 1 January 2020 to 20 February 2020. Results A total of 296 COVID-19 patients were enrolled in the training cohort; 19 died during hospitalization and 277 discharged from the hospital. The clinical model developed using age, history of hypertension, and coronary heart disease showed area under the curve (AUC), 0.88 (95% confidence interval [CI], .80–.95); threshold, −2.6551; sensitivity, 92.31%; specificity, 77.44%; and negative predictive value (NPV), 99.34%. The laboratory model developed using age, high-sensitivity C-reactive protein, peripheral capillary oxygen saturation, neutrophil and lymphocyte count, d-dimer, aspartate aminotransferase, and glomerular filtration rate had a significantly stronger discriminatory power than the clinical model (P = .0157), with AUC, 0.98 (95% CI, .92–.99); threshold, −2.998; sensitivity, 100.00%; specificity, 92.82%; and NPV, 100.00%. In the subsequent validation cohort (N = 44), the AUC (95% CI) was 0.83 (.68–.93) and 0.88 (.75–.96) for the clinical model and laboratory model, respectively. Conclusions We developed 2 predictive models for the in-hospital mortality of patients with COVID-19 in Wuhan that were validated in patients from another center.


GigaScience ◽  
2020 ◽  
Vol 9 (6) ◽  
Author(s):  
Zhen-Hao Guo ◽  
Zhu-Hong You ◽  
Yan-Bin Wang ◽  
De-Shuang Huang ◽  
Hai-Cheng Yi ◽  
...  

Abstract Background The explosive growth of genomic, chemical, and pathological data provides new opportunities and challenges for humans to thoroughly understand life activities in cells. However, there exist few computational models that aggregate various bioentities to comprehensively reveal the physical and functional landscape of biological systems. Results We constructed a molecular association network, which contains 18 edges (relationships) between 8 nodes (bioentities). Based on this, we propose Bioentity2vec, a new method for representing bioentities, which integrates information about the attributes and behaviors of a bioentity. Applying the random forest classifier, we achieved promising performance on 18 relationships, with an area under the curve of 0.9608 and an area under the precision-recall curve of 0.9572. Conclusions Our study shows that constructing a network with rich topological and biological information is important for systematic understanding of the biological landscape at the molecular level. Our results show that Bioentity2vec can effectively represent biological entities and provides easily distinguishable information about classification tasks. Our method is also able to simultaneously predict relationships between single types and multiple types, which will accelerate progress in biological experimental research and industrial product development.


Mathematics ◽  
2020 ◽  
Vol 8 (5) ◽  
pp. 765 ◽  
Author(s):  
Weizhang Liang ◽  
Suizhi Luo ◽  
Guoyan Zhao ◽  
Hao Wu

Predicting pillar stability is a vital task in hard rock mines as pillar instability can cause large-scale collapse hazards. However, it is challenging because the pillar stability is affected by many factors. With the accumulation of pillar stability cases, machine learning (ML) has shown great potential to predict pillar stability. This study aims to predict hard rock pillar stability using gradient boosting decision tree (GBDT), extreme gradient boosting (XGBoost), and light gradient boosting machine (LightGBM) algorithms. First, 236 cases with five indicators were collected from seven hard rock mines. Afterwards, the hyperparameters of each model were tuned using a five-fold cross validation (CV) approach. Based on the optimal hyperparameters configuration, prediction models were constructed using training set (70% of the data). Finally, the test set (30% of the data) was adopted to evaluate the performance of each model. The precision, recall, and F1 indexes were utilized to analyze prediction results of each level, and the accuracy and their macro average values were used to assess the overall prediction performance. Based on the sensitivity analysis of indicators, the relative importance of each indicator was obtained. In addition, the safety factor approach and other ML algorithms were adopted as comparisons. The results showed that GBDT, XGBoost, and LightGBM algorithms achieved a better comprehensive performance, and their prediction accuracies were 0.8310, 0.8310, and 0.8169, respectively. The average pillar stress and ratio of pillar width to pillar height had the most important influences on prediction results. The proposed methodology can provide a reliable reference for pillar design and stability risk management.


Author(s):  
Yae Won Park ◽  
Jihwan Eom ◽  
Sooyon Kim ◽  
Hwiyoung Kim ◽  
Sung Soo Ahn ◽  
...  

Abstract Context Early identification of the response of prolactinoma patients to dopamine agonists (DA) is crucial in treatment planning. Objective To develop a radiomics model using an ensemble machine learning classifier with conventional magnetic resonance images (MRIs) to predict the DA response in prolactinoma patients. Design Retrospective study Setting Severance Hospital Patients A total of 177 prolactinoma patients who underwent baseline MRI (109 DA responders and 68 DA non-responders) were allocated to the training (n = 141) and test (n = 36) sets. Radiomic features (n = 107) were extracted from coronal T2-weighed MRIs. After feature selection, single models (random forest, light gradient boosting machine, extra-trees, quadratic discrimination analysis, and linear discrimination analysis) with oversampling methods were trained to predict the DA response. A soft voting ensemble classifier was used to achieve the final performance. The performance of the classifier was validated in the test set. Results The ensemble classifier showed an area under the curve (AUC) of 0.81 (95 % confidence interval [CI], 0.74–0.87) in the training set. In the test set, the ensemble classifier showed an AUC, accuracy, sensitivity, and specificity of 0.81 (95 % CI, 0.67–0.96), 77.8 %, 78.6 %, and 77.3 %, respectively. The ensemble classifier achieved the highest performance among all the individual models in the test set. Conclusions Radiomic features may be useful biomarkers to predict the DA response in prolactinoma patients.


2018 ◽  
Vol 20 (6) ◽  
pp. 2185-2199 ◽  
Author(s):  
Yanju Zhang ◽  
Ruopeng Xie ◽  
Jiawei Wang ◽  
André Leier ◽  
Tatiana T Marquez-Lago ◽  
...  

AbstractAs a newly discovered post-translational modification (PTM), lysine malonylation (Kmal) regulates a myriad of cellular processes from prokaryotes to eukaryotes and has important implications in human diseases. Despite its functional significance, computational methods to accurately identify malonylation sites are still lacking and urgently needed. In particular, there is currently no comprehensive analysis and assessment of different features and machine learning (ML) methods that are required for constructing the necessary prediction models. Here, we review, analyze and compare 11 different feature encoding methods, with the goal of extracting key patterns and characteristics from residue sequences of Kmal sites. We identify optimized feature sets, with which four commonly used ML methods (random forest, support vector machines, K-nearest neighbor and logistic regression) and one recently proposed [Light Gradient Boosting Machine (LightGBM)] are trained on data from three species, namely, Escherichia coli, Mus musculus and Homo sapiens, and compared using randomized 10-fold cross-validation tests. We show that integration of the single method-based models through ensemble learning further improves the prediction performance and model robustness on the independent test. When compared to the existing state-of-the-art predictor, MaloPred, the optimal ensemble models were more accurate for all three species (AUC: 0.930, 0.923 and 0.944 for E. coli, M. musculus and H. sapiens, respectively). Using the ensemble models, we developed an accessible online predictor, kmal-sp, available at http://kmalsp.erc.monash.edu/. We hope that this comprehensive survey and the proposed strategy for building more accurate models can serve as a useful guide for inspiring future developments of computational methods for PTM site prediction, expedite the discovery of new malonylation and other PTM types and facilitate hypothesis-driven experimental validation of novel malonylated substrates and malonylation sites.


2021 ◽  
Vol 13 (11) ◽  
pp. 2096
Author(s):  
Zhongqi Yu ◽  
Yuanhao Qu ◽  
Yunxin Wang ◽  
Jinghui Ma ◽  
Yu Cao

A visibility forecast model called a boosting-based fusion model (BFM) was established in this study. The model uses a fusion machine learning model based on multisource data, including air pollutants, meteorological observations, moderate resolution imaging spectroradiometer (MODIS) aerosol optical depth (AOD) data, and an operational regional atmospheric environmental modeling System for eastern China (RAEMS) outputs. Extreme gradient boosting (XGBoost), a light gradient boosting machine (LightGBM), and a numerical prediction method, i.e., RAEMS were fused to establish this prediction model. Three sets of prediction models, that is, BFM, LightGBM based on multisource data (LGBM), and RAEMS, were used to conduct visibility prediction tasks. The training set was from 1 January 2015 to 31 December 2018 and used several data pre-processing methods, including a synthetic minority over-sampling technique (SMOTE) data resampling, a loss function adjustment, and a 10-fold cross verification. Moreover, apart from the basic features (variables), more spatial and temporal gradient features were considered. The testing set was from 1 January to 31 December 2019 and was adopted to validate the feasibility of the BFM, LGBM, and RAEMS. Statistical indicators confirmed that the machine learning methods improved the RAEMS forecast significantly and consistently. The root mean square error and correlation coefficient of BFM for the next 24/48 h were 5.01/5.47 km and 0.80/0.77, respectively, which were much higher than those of RAEMS. The statistics and binary score analysis for different areas in Shanghai also proved the reliability and accuracy of using BFM, particularly in low-visibility forecasting. Overall, BFM is a suitable tool for predicting the visibility. It provides a more accurate visibility forecast for the next 24 and 48 h in Shanghai than LGBM and RAEMS. The results of this study provide support for real-time operational visibility forecasts.


Sign in / Sign up

Export Citation Format

Share Document