Assessing Generalisability of Deep Learning Models Trained on Standardised and Non-Standardised Images and their Performance against Tele-dermatologists (Preprint)

2021 ◽  
Author(s):  
Ibukun Oloruntoba ◽  
Tine Vestergaard ◽  
Toan D Nguyen ◽  
Zongyuan Ge ◽  
Victoria Mar

BACKGROUND Convolutional neural networks (CNNs) are a type of artificial intelligence (AI) which show promise as a diagnostic aid for skin cancer. However, the majority are trained using retrospective image datasets of varying quality and image capture standardisation. OBJECTIVE The objective of our study was to use CNN models with the same architecture, but different training image sets, and test variability in performance when classifying skin cancer images in different populations, acquired with different devices. Additionally, we wanted to assess the performance of the models against Danish tele-dermatologists, when tested on images acquired from Denmark. METHODS Three CNNs with the same architecture were trained. CNN-NS was trained on 25,331 non- standardised images taken from the International Skin Imaging Collaboration using different image capture devices. CNN-S was trained on 235,268 standardised images and CNN-S2 was trained on 25,331 standardised images (matched for number and classes of training images to CNN-NS). Both standardised datasets (CNN-S and CNN-S2) were provided by Molemap using the same image capture device. 495 Danish patients with 569 images of skin lesions predominantly involving Fitzpatrick's skin types II and III were used to test the performance of the models. 4 tele-dermatologists independently diagnosed and assessed the images taken of the lesions. Primary outcome measures were sensitivity, specificity and area under the curve of the receiver operating characteristic (AUROC). RESULTS 569 images were taken from 495 patients (280 women [57%], 215 men [43%]; mean age 55 years [17 SD]) for this study. On these images, CNN-S achieved an AUROC of 0.861 (CI 0.830 – 0.889; P=.001) and CNN-S2 achieved an AUROC of 0.831 (CI 0.798 – 0.861; P=.009), with both outperforming CNN-NS, which achieved an AUROC of 0.759 (CI 0.722 – 0.794; P=.001, P=.009) (Figure 1). When the CNNs were matched to the mean sensitivity and specificity of the tele-dermatologists, the model’s resultant sensitivities and specificities were surpassed by the tele-dermatologists (Table 1). However, when compared to CNN-S, the differences were not statistically significant (P=.10, P=.053). Performance across all CNN models as well as tele- dermatologists was influenced by image quality. CONCLUSIONS CNNs trained on standardised images had improved performance and therefore greater generalisability in skin cancer classification when applied to an unseen dataset. This is an important consideration for future algorithm development, regulation and approval. Further, when tested on these unseen test images, the tele-dermatologists ‘clinically’ outperformed all the CNN models; however, the difference was deemed to be statistically insignificant when compared to CNN-S. CLINICALTRIAL This retrospective diagnostic comparative study was approved by the Monash University Human Ethics Committee, Melbourne, Australia (Project ID: 28130).

2021 ◽  
Author(s):  
Ibukun Oloruntoba ◽  
Toan D Nguyen ◽  
Zongyuan Ge ◽  
Tine Vestergaard ◽  
Victoria Mar

BACKGROUND Convolutional neural networks (CNNs) are a type of artificial intelligence that show promise as a diagnostic aid for skin cancer. However, the majority are trained using retrospective image data sets of varying quality and image capture standardization. OBJECTIVE The aim of our study is to use CNN models with the same architecture, but different training image sets, and test variability in performance when classifying skin cancer images in different populations, acquired with different devices. Additionally, we wanted to assess the performance of the models against Danish teledermatologists when tested on images acquired from Denmark. METHODS Three CNNs with the same architecture were trained. CNN-NS was trained on 25,331 nonstandardized images taken from the International Skin Imaging Collaboration using different image capture devices. CNN-S was trained on 235,268 standardized images, and CNN-S2 was trained on 25,331 standardized images (matched for number and classes of training images to CNN-NS). Both standardized data sets (CNN-S and CNN-S2) were provided by Molemap using the same image capture device. A total of 495 Danish patients with 569 images of skin lesions predominantly involving Fitzpatrick skin types II and III were used to test the performance of the models. Four teledermatologists independently diagnosed and assessed the images taken of the lesions. Primary outcome measures were sensitivity, specificity, and area under the curve of the receiver operating characteristic (AUROC). RESULTS A total of 569 images were taken from 495 patients (n=280, 57% women, n=215, 43% men; mean age 55, SD 17 years) for this study. On these images, CNN-S achieved an AUROC of 0.861 (95% CI 0.830-0.889; <i>P</i>&lt;.001), and CNN-S2 achieved an AUROC of 0.831 (95% CI 0.798-0.861; <i>P</i>=.009), with both outperforming CNN-NS, which achieved an AUROC of 0.759 (95% CI 0.722-0.794; <i>P</i>&lt;.001; <i>P</i>=.009). When the CNNs were matched to the mean sensitivity and specificity of the teledermatologists, the model’s resultant sensitivities and specificities were surpassed by the teledermatologists. However, when compared to CNN-S, the differences were not statistically significant (<i>P</i>=.10; <i>P</i>=.05). Performance across all CNN models and teledermatologists was influenced by the image quality. CONCLUSIONS CNNs trained on standardized images had improved performance and therefore greater generalizability in skin cancer classification when applied to an unseen data set. This is an important consideration for future algorithm development, regulation, and approval. Further, when tested on these unseen test images, the teledermatologists <i>clinically</i> outperformed all the CNN models; however, the difference was deemed to be statistically insignificant when compared to CNN-S.


Iproceedings ◽  
10.2196/35391 ◽  
2021 ◽  
Vol 6 (1) ◽  
pp. e35391
Author(s):  
Ibukun Oloruntoba ◽  
Toan D Nguyen ◽  
Zongyuan Ge ◽  
Tine Vestergaard ◽  
Victoria Mar

Background Convolutional neural networks (CNNs) are a type of artificial intelligence that show promise as a diagnostic aid for skin cancer. However, the majority are trained using retrospective image data sets of varying quality and image capture standardization. Objective The aim of our study is to use CNN models with the same architecture, but different training image sets, and test variability in performance when classifying skin cancer images in different populations, acquired with different devices. Additionally, we wanted to assess the performance of the models against Danish teledermatologists when tested on images acquired from Denmark. Methods Three CNNs with the same architecture were trained. CNN-NS was trained on 25,331 nonstandardized images taken from the International Skin Imaging Collaboration using different image capture devices. CNN-S was trained on 235,268 standardized images, and CNN-S2 was trained on 25,331 standardized images (matched for number and classes of training images to CNN-NS). Both standardized data sets (CNN-S and CNN-S2) were provided by Molemap using the same image capture device. A total of 495 Danish patients with 569 images of skin lesions predominantly involving Fitzpatrick skin types II and III were used to test the performance of the models. Four teledermatologists independently diagnosed and assessed the images taken of the lesions. Primary outcome measures were sensitivity, specificity, and area under the curve of the receiver operating characteristic (AUROC). Results A total of 569 images were taken from 495 patients (n=280, 57% women, n=215, 43% men; mean age 55, SD 17 years) for this study. On these images, CNN-S achieved an AUROC of 0.861 (95% CI 0.830-0.889; P<.001), and CNN-S2 achieved an AUROC of 0.831 (95% CI 0.798-0.861; P=.009), with both outperforming CNN-NS, which achieved an AUROC of 0.759 (95% CI 0.722-0.794; P<.001; P=.009). When the CNNs were matched to the mean sensitivity and specificity of the teledermatologists, the model’s resultant sensitivities and specificities were surpassed by the teledermatologists. However, when compared to CNN-S, the differences were not statistically significant (P=.10; P=.05). Performance across all CNN models and teledermatologists was influenced by the image quality. Conclusions CNNs trained on standardized images had improved performance and therefore greater generalizability in skin cancer classification when applied to an unseen data set. This is an important consideration for future algorithm development, regulation, and approval. Further, when tested on these unseen test images, the teledermatologists clinically outperformed all the CNN models; however, the difference was deemed to be statistically insignificant when compared to CNN-S. Conflicts of Interest VM received speakers fees from Merck, Eli Lily, Novartis and Bristol Myers Squibb. VM is the principal investigator for a clinical trial funded by the Victorian Department of Health and Human Services with 1:1 contribution from MoleMap.


2021 ◽  
Author(s):  
Suhong Zhao ◽  
Peipei Chen ◽  
Guangrui Shao ◽  
Baijie Li ◽  
Huikun Zhang ◽  
...  

Abstract Objective: To assess the diagnostic ability of abbreviated protocols of MRI (AP-MRI) compared with unenhanced MRI (UE-MRI) in mammographically occult cancers in patients with dense breast tissue.Materials and Methods: The retrospective analysis consisted of 102 patients without positive findings on mammography who received preoperative MRI full diagnostic protocols (FDP) between January 2015 and December 2018. Two breast radiologists read the UE, AP, and FDP. The interpretation times were recorded. The comparisons of the sensitivity, specificity and area under the curve of each MRI protocol, and the sensitivity of these protocols in each subgroup of different size tumors used the Chi-square test. The paired sample t-test was used for evaluating the difference of reading time of the three protocols.Results: Among 102 women, there were 68 cancers and two benign lesions in 64 patients and 38 patients had benign or negative findings. Both readers found the sensitivity and specificity of AP and UE-MRI were similar (p>0.05), whereas compared with FDP, UE had lower sensitivity (Reader 1/Reader 2: p=0.023, 0.004). For different lesion size groups, one of the readers found that AP and FDP had higher sensitivities than UE-MRI for detecting the lesions ≤10 mm in diameter (p=0.041, p=0.023). Compared with FDP, the average reading time of UE-MRI and AP was remarkably reduced (p < 0.001).Conclusion: AP-MRI had more advantages than UE-MRI to detect mammographically occult cancers, especially for breast tumors ≤10 mm in diameter.


2019 ◽  
Vol 19 (1) ◽  
Author(s):  
Lili Xu ◽  
Gumuyang Zhang ◽  
Bing Shi ◽  
Yanhan Liu ◽  
Tingting Zou ◽  
...  

Abstract Purpose To compare the diagnostic accuracy of biparametric MRI (bpMRI) and multiparametric MRI (mpMRI) for prostate cancer (PCa) and clinically significant prostate cancer (csPCa) and to explore the application value of dynamic contrast-enhanced (DCE) MRI in prostate imaging. Methods and materials This study retrospectively enrolled 235 patients with suspected PCa in our hospital from January 2016 to December 2017, and all lesions were histopathologically confirmed. The lesions were scored according to the Prostate Imaging Reporting and Data System version 2 (PI-RADS V2). The bpMRI (T2-weighted imaging [T2WI], diffusion-weighted imaging [DWI]/apparent diffusion coefficient [ADC]) and mpMRI (T2WI, DWI/ADC and DCE) scores were recorded to plot the receiver operating characteristic (ROC) curves. The area under the curve (AUC), accuracy, sensitivity, specificity, negative predictive value (NPV), and positive predictive value (PPV) for each method were calculated and compared. The patients were further stratified according to bpMRI scores (bpMRI ≥3, and bpMRI = 3, 4, 5) to analyse the difference in DCE MRI between PCa and non-PCa lesions (as well as between csPCa and non-csPCa). Results The AUC values for the bpMRI and mpMRI protocols for PCa were comparable (0.790 [0.732–0.840] and 0.791 [0.733–0.841], respectively). The accuracy, sensitivity, specificity, PPV and NPV of bpMRI for PCa were 76.2, 79.5, 72.6, 75.8, and 76.6%, respectively, and the values for mpMRI were 77.4, 84.4, 69.9, 75.2, and 80.6%, respectively. The AUC values for the bpMRI and mpMRI protocols for the diagnosis of csPCa were similar (0.781 [0.722–0.832] and 0.779 [0.721–0.831], respectively). The accuracy, sensitivity, specificity, PPV and NPV of bpMRI for csPCa were 74.0, 83.8, 66.9, 64.8, and 85.0%, respectively; and 73.6, 87.9, 63.2, 63.2, and 87.8%, respectively, for mpMRI. For patients with bpMRI scores ≥3, positive DCE results were more common in PCa and csPCa lesions (both P = 0.001). Further stratification analysis showed that for patients with a bpMRI score = 4, PCa and csPCa lesions were more likely to have positive DCE results (P = 0.003 and P < 0.001, respectively). Conclusion The diagnostic accuracy of bpMRI is comparable with that of mpMRI in the detection of PCa and the identification of csPCa. DCE MRI is helpful in further identifying PCa and csPCa lesions in patients with bpMRI ≥3, especially bpMRI = 4, which may be conducive to achieving a more accurate PCa risk stratification. Rather than omitting DCE, we think further comprehensive studies are required for prostate MRI.


Author(s):  
Renny Amalia Pratiwi ◽  
Siti Nurmaini ◽  
Dian Palupi Rini ◽  
Muhammad Naufal Rachmatullah ◽  
Annisa Darmawahyuni

<span lang="EN-US">One type of skin cancer that is considered a malignant tumor is melanoma. Such a dangerous disease can cause a lot of death in the world. The early detection of skin lesions becomes an important task in the diagnosis of skin cancer. Recently, a machine learning paradigm emerged known as deep learning (DL) utilized for skin lesions classification. However, in some previous studies by using seven class images diagnostic of skin lesions classification based on a single DL approach with CNNs architecture does not produce a satisfying performance. The DL approach allows the development of a medical image analysis system for improving performance, such as the deep convolutional neural networks (DCNNs) method. In this study, we propose an ensemble learning approach that combines three DCNNs architectures such as Inception V3, Inception ResNet V2 and DenseNet 201 for improving the performance in terms of accuracy, sensitivity, specificity, precision, and F1-score. Seven classes of dermoscopy image categories of skin lesions are utilized with 10015 dermoscopy images from well-known the HAM10000 dataset. The proposed model produces good classification performance with 97.23% accuracy, 90.12% sensitivity, 97.73% specificity, 82.01% precision, and 85.01% F1-Score. This method gives promising results in classifying skin lesions for cancer diagnosis.</span>


2021 ◽  
Vol 30 (160) ◽  
pp. 200350
Author(s):  
Elena Schnieders ◽  
Elyesa Ünal ◽  
Volker Winkler ◽  
Peter Dambach ◽  
Valérie R. Louis ◽  
...  

RationaleGuidelines recommend pre-/post-bronchodilator spirometry for diagnosing COPD, but resource constraints limit the availability of spirometry in primary care in low- and middle-income countries. Although spirometry is the diagnostic gold standard, we shall assess alternative tools for settings without spirometry.MethodsA systematic literature review and meta-analysis was conducted, utilising Cochrane, CINAHL, Google Scholar, PubMed and Web of Science (search cut-off was May 01, 2020). Published studies comparing the accuracy of diagnostic tools for COPD with post-bronchodilator spirometry were considered. Studies without sensitivity/specificity data, without a separate validation sample and outside of primary care were excluded. Sensitivity, specificity and area under the curve (AUC) were assessed.ResultsOf 7578 studies, 24 were included (14 635 participants). Hand devices yielded a larger AUC than questionnaires. The meta-analysis included 17 studies and the overall AUC of micro-spirometers (0.84, 95% CI 0.80–0.89) was larger when compared to the COPD population screener (COPD-PS) questionnaire (0.77, 95% CI 0.63–0.85) and the COPD diagnostic questionnaire (CDQ) (0.72, 95% CI 0.64–0.78). However, only the difference between micro-spirometers and the CDQ was significant.ConclusionsThe CDQ and the COPD-PS questionnaire were approximately equally accurate tools. Questionnaires ensured testing of symptomatic patients, but micro-spirometers were more accurate. A combination could increase accuracy but was not evaluated in the meta-analysis.


Symmetry ◽  
2019 ◽  
Vol 11 (6) ◽  
pp. 790 ◽  
Author(s):  
Upender Kalwa ◽  
Christopher Legner ◽  
Taejoon Kong ◽  
Santosh Pandey

Among the different types of skin cancer, melanoma is considered to be the deadliest and is difficult to treat at advanced stages. Detection of melanoma at earlier stages can lead to reduced mortality rates. Desktop-based computer-aided systems have been developed to assist dermatologists with early diagnosis. However, there is significant interest in developing portable, at-home melanoma diagnostic systems which can assess the risk of cancerous skin lesions. Here, we present a smartphone application that combines image capture capabilities with preprocessing and segmentation to extract the Asymmetry, Border irregularity, Color variegation, and Diameter (ABCD) features of a skin lesion. Using the feature sets, classification of malignancy is achieved through support vector machine classifiers. By using adaptive algorithms in the individual data-processing stages, our approach is made computationally light, user friendly, and reliable in discriminating melanoma cases from benign ones. Images of skin lesions are either captured with the smartphone camera or imported from public datasets. The entire process from image capture to classification runs on an Android smartphone equipped with a detachable 10x lens, and processes an image in less than a second. The overall performance metrics are evaluated on a public database of 200 images with Synthetic Minority Over-sampling Technique (SMOTE) (80% sensitivity, 90% specificity, 88% accuracy, and 0.85 area under curve (AUC)) and without SMOTE (55% sensitivity, 95% specificity, 90% accuracy, and 0.75 AUC). The evaluated performance metrics and computation times are comparable or better than previous methods. This all-inclusive smartphone application is designed to be easy-to-download and easy-to-navigate for the end user, which is imperative for the eventual democratization of such medical diagnostic systems.


Author(s):  
A. Dascalu ◽  
B. N. Walker ◽  
Y. Oron ◽  
E. O. David

Abstract Purpose Non-melanoma skin cancer (NMSC) is the most frequent keratinocyte-origin skin tumor. It is confirmed that dermoscopy of NMSC confers a diagnostic advantage as compared to visual face-to-face assessment. COVID-19 restrictions diagnostics by telemedicine photos, which are analogous to visual inspection, displaced part of in-person visits. This study evaluated by a dual convolutional neural network (CNN) performance metrics in dermoscopic (DI) versus smartphone-captured images (SI) and tested if artificial intelligence narrows the proclaimed gap in diagnostic accuracy. Methods A CNN that receives a raw image and predicts malignancy, overlaid by a second independent CNN which processes a sonification (image-to-sound mapping) of the original image, were combined into a unified malignancy classifier. All images were histopathology-verified in a comparison between NMSC and benign skin lesions excised as suspected NMSCs. Study criteria outcomes were sensitivity and specificity for the unified output. Results Images acquired by DI (n = 132 NMSC, n = 33 benign) were compared to SI (n = 170 NMSC, n = 28 benign). DI and SI analysis metrics resulted in an area under the curve (AUC) of the receiver operator characteristic curve of 0.911 and 0.821, respectively. Accuracy was increased by DI (0.88; CI 81.9–92.4) as compared to SI (0.75; CI 68.1–80.6, p < 0.005). Sensitivity of DI was higher than SI (95.3%, CI 90.4–98.3 vs 75.3%, CI 68.1–81.6, p < 0.001), but not specificity (p = NS). Conclusion Telemedicine use of smartphone images might result in a substantial decrease in diagnostic performance as compared to dermoscopy, which needs to be considered by both healthcare providers and patients.


2008 ◽  
Vol 47 (04) ◽  
pp. 163-166 ◽  
Author(s):  
D. Steiner ◽  
S. Laurich ◽  
R. Bauer ◽  
J. Kordelle ◽  
R. Klett

SummaryIn not infected knee prostheses bone scintigraphy is a possible method to diagnose mechanical loosening, and therefore, to affect treatment regimes in symptomatic patients. However, hitherto studies showed controversial results for the reliability of bone scintigraphy in diagnosing loosened knee prostheses by using asymptomatic control groups. Therefore, the aim of our study was to optimize the interpretation procedure and to evaluate the accuracy using results from revision surgery as standard. Methods: Retrospectively, we were able to examine the tibial component in 31 cemented prostheses. In this prostheses infection was excluded by histological or bacteriological examination during revision surgery. To quantify bone scintigraphy, we used medial and lateral tibial regions with a reference region from the contralateral femur. Results: To differentiate between loosened and intact prostheses we found a threshold of 5.0 for the maximum tibia to femur ratio of the both tibial regions and a threshold of 18% for the difference of the ratio of both tibial regions. Using these thresholds, values of 0.9, 1, 0.85, 1, and 0.94 were calculated for sensitivity, specificity, negative predictive value, positive predictive value, and accuracy, respectively. To get a sensitivity of 1, we found a lower threshold of 3.3 for the maximum tibia to femur ratio. Conclusion: Quantitative bone scintigraphy appears to be a reliable diagnostic tool for aseptic loosening of knee prostheses with thresholds evaluated by revision surgery results being the golden standard.


Author(s):  
Sagar Suman Panda ◽  
Ravi Kumar B.V.V.

Three new analytical methods were optimized and validated for the estimation of tigecycline (TGN) in its injection formulation. A difference UV spectroscopic, an area under the curve (AUC), and an ultrafast liquid chromatographic (UFLC) method were optimized for this purpose. The difference spectrophotometric method relied on the measurement of amplitude when equal concentration solutions of TGN in HCl are scanned against TGN in NaOH as reference. The measurements were done at 340 nm (maxima) and 410nm (minima). Further, the AUC under both the maxima and minima were measured at 335-345nm and 405-415nm, respectively. The liquid chromatographic method utilized a reversed-phase column (150mm×4.6mm, 5µm) with a mobile phase of methanol: 0.01M KH2PO4 buffer pH 3.5 (using orthophosphoric acid) in the ratio 80:20 %, v/v. The flow rate was 1.0ml/min, and diode array detection was done at 349nm. TGN eluted at 1.656min. All the methods were validated for linearity, precision, accuracy, stability, and robustness. The developed methods produced validation results within the satisfactory limits of ICH guidance. Further, these methods were applied to estimate the amount of TGN present in commercial lyophilized injection formulations, and the results were compared using the One-Way ANOVA test. Overall, the methods are rapid, simple, and reliable for routine quality control of TGN in the bulk and pharmaceutical dosage form. 


Sign in / Sign up

Export Citation Format

Share Document