Assessing Generalisability of Deep Learning Models Trained on Standardised and Non-Standardised Images and their Performance against Tele-dermatologists (Preprint)
BACKGROUND Convolutional neural networks (CNNs) are a type of artificial intelligence (AI) which show promise as a diagnostic aid for skin cancer. However, the majority are trained using retrospective image datasets of varying quality and image capture standardisation. OBJECTIVE The objective of our study was to use CNN models with the same architecture, but different training image sets, and test variability in performance when classifying skin cancer images in different populations, acquired with different devices. Additionally, we wanted to assess the performance of the models against Danish tele-dermatologists, when tested on images acquired from Denmark. METHODS Three CNNs with the same architecture were trained. CNN-NS was trained on 25,331 non- standardised images taken from the International Skin Imaging Collaboration using different image capture devices. CNN-S was trained on 235,268 standardised images and CNN-S2 was trained on 25,331 standardised images (matched for number and classes of training images to CNN-NS). Both standardised datasets (CNN-S and CNN-S2) were provided by Molemap using the same image capture device. 495 Danish patients with 569 images of skin lesions predominantly involving Fitzpatrick's skin types II and III were used to test the performance of the models. 4 tele-dermatologists independently diagnosed and assessed the images taken of the lesions. Primary outcome measures were sensitivity, specificity and area under the curve of the receiver operating characteristic (AUROC). RESULTS 569 images were taken from 495 patients (280 women [57%], 215 men [43%]; mean age 55 years [17 SD]) for this study. On these images, CNN-S achieved an AUROC of 0.861 (CI 0.830 – 0.889; P=.001) and CNN-S2 achieved an AUROC of 0.831 (CI 0.798 – 0.861; P=.009), with both outperforming CNN-NS, which achieved an AUROC of 0.759 (CI 0.722 – 0.794; P=.001, P=.009) (Figure 1). When the CNNs were matched to the mean sensitivity and specificity of the tele-dermatologists, the model’s resultant sensitivities and specificities were surpassed by the tele-dermatologists (Table 1). However, when compared to CNN-S, the differences were not statistically significant (P=.10, P=.053). Performance across all CNN models as well as tele- dermatologists was influenced by image quality. CONCLUSIONS CNNs trained on standardised images had improved performance and therefore greater generalisability in skin cancer classification when applied to an unseen dataset. This is an important consideration for future algorithm development, regulation and approval. Further, when tested on these unseen test images, the tele-dermatologists ‘clinically’ outperformed all the CNN models; however, the difference was deemed to be statistically insignificant when compared to CNN-S. CLINICALTRIAL This retrospective diagnostic comparative study was approved by the Monash University Human Ethics Committee, Melbourne, Australia (Project ID: 28130).