scholarly journals Assessing Generalizability of Deep Learning Models Trained on Standardized and Nonstandardized Images and Their Performance Against Teledermatologists

Iproceedings ◽  
10.2196/35391 ◽  
2021 ◽  
Vol 6 (1) ◽  
pp. e35391
Author(s):  
Ibukun Oloruntoba ◽  
Toan D Nguyen ◽  
Zongyuan Ge ◽  
Tine Vestergaard ◽  
Victoria Mar

Background Convolutional neural networks (CNNs) are a type of artificial intelligence that show promise as a diagnostic aid for skin cancer. However, the majority are trained using retrospective image data sets of varying quality and image capture standardization. Objective The aim of our study is to use CNN models with the same architecture, but different training image sets, and test variability in performance when classifying skin cancer images in different populations, acquired with different devices. Additionally, we wanted to assess the performance of the models against Danish teledermatologists when tested on images acquired from Denmark. Methods Three CNNs with the same architecture were trained. CNN-NS was trained on 25,331 nonstandardized images taken from the International Skin Imaging Collaboration using different image capture devices. CNN-S was trained on 235,268 standardized images, and CNN-S2 was trained on 25,331 standardized images (matched for number and classes of training images to CNN-NS). Both standardized data sets (CNN-S and CNN-S2) were provided by Molemap using the same image capture device. A total of 495 Danish patients with 569 images of skin lesions predominantly involving Fitzpatrick skin types II and III were used to test the performance of the models. Four teledermatologists independently diagnosed and assessed the images taken of the lesions. Primary outcome measures were sensitivity, specificity, and area under the curve of the receiver operating characteristic (AUROC). Results A total of 569 images were taken from 495 patients (n=280, 57% women, n=215, 43% men; mean age 55, SD 17 years) for this study. On these images, CNN-S achieved an AUROC of 0.861 (95% CI 0.830-0.889; P<.001), and CNN-S2 achieved an AUROC of 0.831 (95% CI 0.798-0.861; P=.009), with both outperforming CNN-NS, which achieved an AUROC of 0.759 (95% CI 0.722-0.794; P<.001; P=.009). When the CNNs were matched to the mean sensitivity and specificity of the teledermatologists, the model’s resultant sensitivities and specificities were surpassed by the teledermatologists. However, when compared to CNN-S, the differences were not statistically significant (P=.10; P=.05). Performance across all CNN models and teledermatologists was influenced by the image quality. Conclusions CNNs trained on standardized images had improved performance and therefore greater generalizability in skin cancer classification when applied to an unseen data set. This is an important consideration for future algorithm development, regulation, and approval. Further, when tested on these unseen test images, the teledermatologists clinically outperformed all the CNN models; however, the difference was deemed to be statistically insignificant when compared to CNN-S. Conflicts of Interest VM received speakers fees from Merck, Eli Lily, Novartis and Bristol Myers Squibb. VM is the principal investigator for a clinical trial funded by the Victorian Department of Health and Human Services with 1:1 contribution from MoleMap.

2021 ◽  
Author(s):  
Ibukun Oloruntoba ◽  
Toan D Nguyen ◽  
Zongyuan Ge ◽  
Tine Vestergaard ◽  
Victoria Mar

BACKGROUND Convolutional neural networks (CNNs) are a type of artificial intelligence that show promise as a diagnostic aid for skin cancer. However, the majority are trained using retrospective image data sets of varying quality and image capture standardization. OBJECTIVE The aim of our study is to use CNN models with the same architecture, but different training image sets, and test variability in performance when classifying skin cancer images in different populations, acquired with different devices. Additionally, we wanted to assess the performance of the models against Danish teledermatologists when tested on images acquired from Denmark. METHODS Three CNNs with the same architecture were trained. CNN-NS was trained on 25,331 nonstandardized images taken from the International Skin Imaging Collaboration using different image capture devices. CNN-S was trained on 235,268 standardized images, and CNN-S2 was trained on 25,331 standardized images (matched for number and classes of training images to CNN-NS). Both standardized data sets (CNN-S and CNN-S2) were provided by Molemap using the same image capture device. A total of 495 Danish patients with 569 images of skin lesions predominantly involving Fitzpatrick skin types II and III were used to test the performance of the models. Four teledermatologists independently diagnosed and assessed the images taken of the lesions. Primary outcome measures were sensitivity, specificity, and area under the curve of the receiver operating characteristic (AUROC). RESULTS A total of 569 images were taken from 495 patients (n=280, 57% women, n=215, 43% men; mean age 55, SD 17 years) for this study. On these images, CNN-S achieved an AUROC of 0.861 (95% CI 0.830-0.889; <i>P</i>&lt;.001), and CNN-S2 achieved an AUROC of 0.831 (95% CI 0.798-0.861; <i>P</i>=.009), with both outperforming CNN-NS, which achieved an AUROC of 0.759 (95% CI 0.722-0.794; <i>P</i>&lt;.001; <i>P</i>=.009). When the CNNs were matched to the mean sensitivity and specificity of the teledermatologists, the model’s resultant sensitivities and specificities were surpassed by the teledermatologists. However, when compared to CNN-S, the differences were not statistically significant (<i>P</i>=.10; <i>P</i>=.05). Performance across all CNN models and teledermatologists was influenced by the image quality. CONCLUSIONS CNNs trained on standardized images had improved performance and therefore greater generalizability in skin cancer classification when applied to an unseen data set. This is an important consideration for future algorithm development, regulation, and approval. Further, when tested on these unseen test images, the teledermatologists <i>clinically</i> outperformed all the CNN models; however, the difference was deemed to be statistically insignificant when compared to CNN-S.


2021 ◽  
Author(s):  
Ibukun Oloruntoba ◽  
Tine Vestergaard ◽  
Toan D Nguyen ◽  
Zongyuan Ge ◽  
Victoria Mar

BACKGROUND Convolutional neural networks (CNNs) are a type of artificial intelligence (AI) which show promise as a diagnostic aid for skin cancer. However, the majority are trained using retrospective image datasets of varying quality and image capture standardisation. OBJECTIVE The objective of our study was to use CNN models with the same architecture, but different training image sets, and test variability in performance when classifying skin cancer images in different populations, acquired with different devices. Additionally, we wanted to assess the performance of the models against Danish tele-dermatologists, when tested on images acquired from Denmark. METHODS Three CNNs with the same architecture were trained. CNN-NS was trained on 25,331 non- standardised images taken from the International Skin Imaging Collaboration using different image capture devices. CNN-S was trained on 235,268 standardised images and CNN-S2 was trained on 25,331 standardised images (matched for number and classes of training images to CNN-NS). Both standardised datasets (CNN-S and CNN-S2) were provided by Molemap using the same image capture device. 495 Danish patients with 569 images of skin lesions predominantly involving Fitzpatrick's skin types II and III were used to test the performance of the models. 4 tele-dermatologists independently diagnosed and assessed the images taken of the lesions. Primary outcome measures were sensitivity, specificity and area under the curve of the receiver operating characteristic (AUROC). RESULTS 569 images were taken from 495 patients (280 women [57%], 215 men [43%]; mean age 55 years [17 SD]) for this study. On these images, CNN-S achieved an AUROC of 0.861 (CI 0.830 – 0.889; P=.001) and CNN-S2 achieved an AUROC of 0.831 (CI 0.798 – 0.861; P=.009), with both outperforming CNN-NS, which achieved an AUROC of 0.759 (CI 0.722 – 0.794; P=.001, P=.009) (Figure 1). When the CNNs were matched to the mean sensitivity and specificity of the tele-dermatologists, the model’s resultant sensitivities and specificities were surpassed by the tele-dermatologists (Table 1). However, when compared to CNN-S, the differences were not statistically significant (P=.10, P=.053). Performance across all CNN models as well as tele- dermatologists was influenced by image quality. CONCLUSIONS CNNs trained on standardised images had improved performance and therefore greater generalisability in skin cancer classification when applied to an unseen dataset. This is an important consideration for future algorithm development, regulation and approval. Further, when tested on these unseen test images, the tele-dermatologists ‘clinically’ outperformed all the CNN models; however, the difference was deemed to be statistically insignificant when compared to CNN-S. CLINICALTRIAL This retrospective diagnostic comparative study was approved by the Monash University Human Ethics Committee, Melbourne, Australia (Project ID: 28130).


2015 ◽  
Vol 2015 ◽  
pp. 1-9 ◽  
Author(s):  
Gurman Gill ◽  
Reinhard R. Beichel

Dynamic and longitudinal lung CT imaging produce 4D lung image data sets, enabling applications like radiation treatment planning or assessment of response to treatment of lung diseases. In this paper, we present a 4D lung segmentation method that mutually utilizes all individual CT volumes to derive segmentations for each CT data set. Our approach is based on a 3D robust active shape model and extends it to fully utilize 4D lung image data sets. This yields an initial segmentation for the 4D volume, which is then refined by using a 4D optimal surface finding algorithm. The approach was evaluated on a diverse set of 152 CT scans of normal and diseased lungs, consisting of total lung capacity and functional residual capacity scan pairs. In addition, a comparison to a 3D segmentation method and a registration based 4D lung segmentation approach was performed. The proposed 4D method obtained an average Dice coefficient of0.9773±0.0254, which was statistically significantly better (pvalue≪0.001) than the 3D method (0.9659±0.0517). Compared to the registration based 4D method, our method obtained better or similar performance, but was 58.6% faster. Also, the method can be easily expanded to process 4D CT data sets consisting of several volumes.


Separations ◽  
2018 ◽  
Vol 5 (3) ◽  
pp. 44 ◽  
Author(s):  
Alyssa Allen ◽  
Mary Williams ◽  
Nicholas Thurn ◽  
Michael Sigman

Computational models for determining the strength of fire debris evidence based on likelihood ratios (LR) were developed and validated against data sets derived from different distributions of ASTM E1618-14 designated ignitable liquid class and substrate pyrolysis contributions using in-silico generated data. The models all perform well in cross validation against the distributions used to generate the model. However, a model generated based on data that does not contain representatives from all of the ASTM E1618-14 classes does not perform well in validation with data sets that contain representatives from the missing classes. A quadratic discriminant model based on a balanced data set (ignitable liquid versus substrate pyrolysis), with a uniform distribution of the ASTM E1618-14 classes, performed well (receiver operating characteristic area under the curve of 0.836) when tested against laboratory-developed casework-relevant samples of known ground truth.


2005 ◽  
Vol 1 ◽  
pp. 117693510500100 ◽  
Author(s):  
Sreelatha Meleth ◽  
Isam-Eldin Eltoum ◽  
Liu Zhu ◽  
Denise Oelschlager ◽  
Chandrika Piyathilake ◽  
...  

Background Most published literature using SELDI-TOF has used traditional techniques in Spectral Analysis such as Fourier transforms and wavelets for denoising. Most of these publications also compare spectra using their most prominent feature, ie, peaks or local maximums. Methods The maximum intensity value within each window of differentiable m/z values was used to represent the intensity level in that window. We also calculated the ‘Area under the Curve’ (AUC) spanned by each window. Results Keeping everything else constant, such as pre-processing of the data and the classifier used, the AUC performed much better as a metric of comparison than the peaks in two out of three data sets. In the third data set both metrics performed equivalently. Conclusions This study shows that the feature used to compare spectra can have an impact on the results of a study attempting to identify biomarkers using SELDI TOF data.


Author(s):  
Krishna Prasad K ◽  
Aithal P. S. ◽  
Navin N. Bappalige ◽  
Soumya S

Purpose: Predicting and then preventing cardiac arrest of a patient in ICU is the most challenging phase even for a most highly skilled professional. The data been collected in ICU for a patient are huge, and the selection of a portion of data for preventing cardiac arrest in a quantum of time is highly decisive, analysing and predicting that large data require an effective system. An effective integration of computer applications and cardiovascular data is necessary to predict the cardiovascular risks. A machine learning technique is the right choice in the advent of technology to manage patients with cardiac arrest. Methodology: In this work we have collected and merged three data sets, Cleveland Dataset of US patients with total 303 records, Statlog Dataset of UK patients with 270 records, and Hungarian dataset of Hungary, Switzerland with 617 records. These data are the most comprehensive data set with a combination of all three data sets consisting of 11 common features with 1190 records. Findings/Results: Feature extraction phase extracts 7 features, which contribute to the event. In addition, extracted features are used to train the selected machine learning classifier models, and results are obtained and obtained results are then evaluated using test data and final results are drawn. Extra Tree Classifier has the highest value of 0.957 for average area under the curve (AUC). Originality: The originality of this combined Dataset analysis using machine learning classifier model results Extra Tree Classifier with highest value of 0.957 for average area under the curve (AUC). Paper Type: Experimental Research Keywords: Cardiac, Machine Learning, Random Forest, XBOOST, ROC AUC, ST Slope.


2021 ◽  
Vol 7 (2) ◽  
pp. 755-758
Author(s):  
Daniel Wulff ◽  
Mohamad Mehdi ◽  
Floris Ernst ◽  
Jannis Hagenah

Abstract Data augmentation is a common method to make deep learning assessible on limited data sets. However, classical image augmentation methods result in highly unrealistic images on ultrasound data. Another approach is to utilize learning-based augmentation methods, e.g. based on variational autoencoders or generative adversarial networks. However, a large amount of data is necessary to train these models, which is typically not available in scenarios where data augmentation is needed. One solution for this problem could be a transfer of augmentation models between different medical imaging data sets. In this work, we present a qualitative study of the cross data set generalization performance of different learning-based augmentation methods for ultrasound image data. We could show that knowledge transfer is possible in ultrasound image augmentation and that the augmentation partially results in semantically meaningful transfers of structures, e.g. vessels, across domains.


2000 ◽  
Vol 6 (S2) ◽  
pp. 1052-1053
Author(s):  
P. G. Kotula ◽  
M. R. Keenan

As more x-ray energy dispersive spectroscopy (EDS) manufacturers begin to offer spectrum imaging (a complete x-ray spectrum from each pixel in an image), there is a clear need for robust and automated methods for quickly extracting the relevant information from the large spectrum image data sets. A typical spectrum image may consist of 100 x 100 pixels (10000 spectra) each with 1000 channels which (when stored at double precision) is 80 Mbytes. It is clear that a large four-dimensional data set such as this cannot be viewed in its entirety and the time to analyze individual spectra by hand is prohibitive. Conventional analysis of spectrum images by mapping energy windows is useful as a first pass only for finding the elements present and only if at sufficient concentrations. Additional problems with mapping include systematic overlaps of other x-ray peaks, changes in the background shape and displaying the maps so they faithfully portray the actual signal intensity.


2021 ◽  
pp. 1-26
Author(s):  
Richard C. Gerum ◽  
Achim Schilling

Up to now, modern machine learning (ML) has been based on approximating big data sets with high-dimensional functions, taking advantage of huge computational resources. We show that biologically inspired neuron models such as the leaky-integrate-and-fire (LIF) neuron provide novel and efficient ways of information processing. They can be integrated in machine learning models and are a potential target to improve ML performance. Thus, we have derived simple update rules for LIF units to numerically integrate the differential equations. We apply a surrogate gradient approach to train the LIF units via backpropagation. We demonstrate that tuning the leak term of the LIF neurons can be used to run the neurons in different operating modes, such as simple signal integrators or coincidence detectors. Furthermore, we show that the constant surrogate gradient, in combination with tuning the leak term of the LIF units, can be used to achieve the learning dynamics of more complex surrogate gradients. To prove the validity of our method, we applied it to established image data sets (the Oxford 102 flower data set, MNIST), implemented various network architectures, used several input data encodings and demonstrated that the method is suitable to achieve state-of-the-art classification performance. We provide our method as well as further surrogate gradient methods to train spiking neural networks via backpropagation as an open-source KERAS package to make it available to the neuroscience and machine learning community. To increase the interpretability of the underlying effects and thus make a small step toward opening the black box of machine learning, we provide interactive illustrations, with the possibility of systematically monitoring the effects of parameter changes on the learning characteristics.


Author(s):  
Tatyana Biloborodova ◽  
Inna Skarga-Bandurova ◽  
Mark Koverga

The methodology of solving the problem of eliminating class imbalance in image data sets is presented. The proposed methodology includes the stages of image fragment extraction, fragment augmentation, feature extraction, duplication of minority objects, and is based on reinforcement learning technology. The degree of imbalance indicator was used as a measure to determine the imbalance of the data set. An experiment was performed using a set of images of the faces of patients with skin rashes, annotated according to the severity of acne. The main steps of the methodology implementation are considered. The results of the classification showed the feasibility of applying the proposed methodology. The accuracy of classification on test data was 85%, which is 5% higher than the result obtained without the use of the proposed methodology. Key words: class imbalance, unbalanced data set, image fragment extraction, augmentation.


Sign in / Sign up

Export Citation Format

Share Document