scholarly journals Investigating the impact of development and internal validation design when training prognostic models using a retrospective cohort in big US observational healthcare data

BMJ Open ◽  
2021 ◽  
Vol 11 (12) ◽  
pp. e050146
Author(s):  
Jenna M Reps ◽  
Patrick Ryan ◽  
P R Rijnbeek

ObjectiveThe internal validation of prediction models aims to quantify the generalisability of a model. We aim to determine the impact, if any, that the choice of development and internal validation design has on the internal performance bias and model generalisability in big data (n~500 000).DesignRetrospective cohort.SettingPrimary and secondary care; three US claims databases.Participants1 200 769 patients pharmaceutically treated for their first occurrence of depression.MethodsWe investigated the impact of the development/validation design across 21 real-world prediction questions. Model discrimination and calibration were assessed. We trained LASSO logistic regression models using US claims data and internally validated the models using eight different designs: ‘no test/validation set’, ‘test/validation set’ and cross validation with 3-fold, 5-fold or 10-fold with and without a test set. We then externally validated each model in two new US claims databases. We estimated the internal validation bias per design by empirically comparing the differences between the estimated internal performance and external performance.ResultsThe differences between the models’ internal estimated performances and external performances were largest for the ‘no test/validation set’ design. This indicates even with large data the ‘no test/validation set’ design causes models to overfit. The seven alternative designs included some validation process to select the hyperparameters and a fair testing process to estimate internal performance. These designs had similar internal performance estimates and performed similarly when externally validated in the two external databases.ConclusionsEven with big data, it is important to use some validation process to select the optimal hyperparameters and fairly assess internal validation using a test set or cross-validation.

Symmetry ◽  
2020 ◽  
Vol 12 (3) ◽  
pp. 431 ◽  
Author(s):  
Tomislav Horvat ◽  
Ladislav Havaš ◽  
Dunja Srpak

Interest in sports predictions as well as the public availability of large amounts of structured and unstructured data are increasing every day. As sporting events are not completely independent events, but characterized by the influence of the human factor, the adequate selection of the analysis process is very important. In this paper, seven different classification machine learning algorithms are used and validated with two validation methods: Train&Test and cross-validation. Validation methods were analyzed and critically reviewed. The obtained results are analyzed and compared. Analyzing the results of the used machine learning algorithms, the best average prediction results were obtained by using the nearest neighbors algorithm and the worst prediction results were obtained by using decision trees. The cross-validation method obtained better results than the Train&Test validation method. The prediction results of the Train&Test validation method by using disjoint datasets and up-to-date data were also compared. Better results were obtained by using up-to-date data. In addition, directions for future research are also explained.


Author(s):  
Daiwei Han ◽  
Marjolein Heuvelmans ◽  
Mieneke Rook ◽  
Monique Dorrius ◽  
Luutsen van Houten ◽  
...  

Abstract Objectives To evaluate the performance of a novel convolutional neural network (CNN) for the classification of typical perifissural nodules (PFN). Methods Chest CT data from two centers in the UK and The Netherlands (1668 unique nodules, 1260 individuals) were collected. Pulmonary nodules were classified into subtypes, including “typical PFNs” on-site, and were reviewed by a central clinician. The dataset was divided into a training/cross-validation set of 1557 nodules (1103 individuals) and a test set of 196 nodules (158 individuals). For the test set, three radiologically trained readers classified the nodules into three nodule categories: typical PFN, atypical PFN, and non-PFN. The consensus of the three readers was used as reference to evaluate the performance of the PFN-CNN. Typical PFNs were considered as positive results, and atypical PFNs and non-PFNs were grouped as negative results. PFN-CNN performance was evaluated using the ROC curve, confusion matrix, and Cohen’s kappa. Results Internal validation yielded a mean AUC of 91.9% (95% CI 90.6–92.9) with 78.7% sensitivity and 90.4% specificity. For the test set, the reader consensus rated 45/196 (23%) of nodules as typical PFN. The classifier-reader agreement (k = 0.62–0.75) was similar to the inter-reader agreement (k = 0.64–0.79). Area under the ROC curve was 95.8% (95% CI 93.3–98.4), with a sensitivity of 95.6% (95% CI 84.9–99.5), and specificity of 88.1% (95% CI 81.8–92.8). Conclusion The PFN-CNN showed excellent performance in classifying typical PFNs. Its agreement with radiologically trained readers is within the range of inter-reader agreement. Thus, the CNN-based system has potential in clinical and screening settings to rule out perifissural nodules and increase reader efficiency. Key Points • Agreement between the PFN-CNN and radiologically trained readers is within the range of inter-reader agreement. • The CNN model for the classification of typical PFNs achieved an AUC of 95.8% (95% CI 93.3–98.4) with 95.6% (95% CI 84.9–99.5) sensitivity and 88.1% (95% CI 81.8–92.8) specificity compared to the consensus of three readers.


2020 ◽  
Author(s):  
Sergey Kucheryavskiy ◽  
Sergei Zhilin ◽  
Oxana Ye. Rodionova ◽  
Alexey L. Pomerantsev

<div><div><div><p>In this paper we propose a new approach for validation of chemometric models. It is based on k-fold cross-validation algorithm, but, in contrast to conventional cross-validation, our approach makes possible to create a new dataset, which carries sampling uncertainty estimated by the cross-validation procedure. This dataset, called <i>pseudo-validation set</i>, can be used similar to independent test set, giving a possibility to compute residual distances, explained variance, scores and other results, which can not be obtained in the conventional cross-validation. The paper describes theoretical details of the proposed approach and its implementation as well as presents experimental results obtained using simulated and real chemical datasets.</p></div></div></div>


2020 ◽  
Author(s):  
Sergey Kucheryavskiy ◽  
Sergei Zhilin ◽  
Oxana Ye. Rodionova ◽  
Alexey L. Pomerantsev

<div><div><div><p>In this paper we propose a new approach for validation of chemometric models. It is based on k-fold cross-validation algorithm, but, in contrast to conventional cross-validation, our approach makes possible to create a new dataset, which carries sampling uncertainty estimated by the cross-validation procedure. This dataset, called <i>pseudo-validation set</i>, can be used similar to independent test set, giving a possibility to compute residual distances, explained variance, scores and other results, which can not be obtained in the conventional cross-validation. The paper describes theoretical details of the proposed approach and its implementation as well as presents experimental results obtained using simulated and real chemical datasets.</p></div></div></div>


2021 ◽  
Vol 12 (2) ◽  
Author(s):  
Mohammad Haekal ◽  
Henki Bayu Seta ◽  
Mayanda Mega Santoni

Untuk memprediksi kualitas air sungai Ciliwung, telah dilakukan pengolahan data-data hasil pemantauan secara Online Monitoring dengan menggunakan Metode Data Mining. Pada metode ini, pertama-tama data-data hasil pemantauan dibuat dalam bentuk tabel Microsoft Excel, kemudian diolah menjadi bentuk Pohon Keputusan yang disebut Algoritma Pohon Keputusan (Decision Tree) mengunakan aplikasi WEKA. Metode Pohon Keputusan dipilih karena lebih sederhana, mudah dipahami dan mempunyai tingkat akurasi yang sangat tinggi. Jumlah data hasil pemantauan kualitas air sungai Ciliwung yang diolah sebanyak 5.476 data. Hasil klarifikasi dengan Pohon Keputusan, dari 5.476 data ini diperoleh jumlah data yang mengindikasikan sungai Ciliwung Tidak Tercemar sebanyak 1.059 data atau sebesar 19,3242%, dan yang mengindikasikan Tercemar sebanyak 4.417 data atau 80,6758%. Selanjutnya data-data hasil pemantauan ini dievaluasi menggunakan 4 Opsi Tes (Test Option) yaitu dengan Use Training Set, Supplied Test Set, Cross-Validation folds 10, dan Percentage Split 66%. Hasil evaluasi dengan 4 opsi tes yang digunakan ini, semuanya menunjukkan tingkat akurasi yang sangat tinggi, yaitu diatas 99%. Dari data-data hasil peneltian ini dapat diprediksi bahwa sungai Ciliwung terindikasi sebagai sungai tercemar bila mereferensi kepada Peraturan Pemerintah Republik Indonesia nomor 82 tahun 2001 dan diketahui pula bahwa penggunaan aplikasi WEKA dengan Algoritma Pohon Keputusan untuk mengolah data-data hasil pemantauan dengan mengambil tiga parameter (pH, DO dan Nitrat) adalah sangat akuran dan tepat. Kata Kunci : Kualitas air sungai, Data Mining, Algoritma Pohon Keputusan, Aplikasi WEKA.


Sign in / Sign up

Export Citation Format

Share Document