Assessing Generalizability of Deep Learning Models Trained on Standardized and Nonstandardized Images and Their Performance Against Teledermatologists

Ibukun Oloruntoba; Toan D Nguyen; Zongyuan Ge; Tine Vestergaard; Victoria Mar

doi:10.2196/35391

Assessing Generalizability of Deep Learning Models Trained on Standardized and Nonstandardized Images and Their Performance Against Teledermatologists

Iproceedings ◽

10.2196/35391 ◽

2021 ◽

Vol 6 (1) ◽

pp. e35391

Author(s):

Ibukun Oloruntoba ◽

Toan D Nguyen ◽

Zongyuan Ge ◽

Tine Vestergaard ◽

Victoria Mar

Keyword(s):

Skin Cancer ◽

Conflicts Of Interest ◽

Area Under The Curve ◽

Image Data ◽

Skin Lesions ◽

Training Image ◽

Data Sets ◽

Image Capture ◽

Data Set ◽

Unseen Data

Background Convolutional neural networks (CNNs) are a type of artificial intelligence that show promise as a diagnostic aid for skin cancer. However, the majority are trained using retrospective image data sets of varying quality and image capture standardization. Objective The aim of our study is to use CNN models with the same architecture, but different training image sets, and test variability in performance when classifying skin cancer images in different populations, acquired with different devices. Additionally, we wanted to assess the performance of the models against Danish teledermatologists when tested on images acquired from Denmark. Methods Three CNNs with the same architecture were trained. CNN-NS was trained on 25,331 nonstandardized images taken from the International Skin Imaging Collaboration using different image capture devices. CNN-S was trained on 235,268 standardized images, and CNN-S2 was trained on 25,331 standardized images (matched for number and classes of training images to CNN-NS). Both standardized data sets (CNN-S and CNN-S2) were provided by Molemap using the same image capture device. A total of 495 Danish patients with 569 images of skin lesions predominantly involving Fitzpatrick skin types II and III were used to test the performance of the models. Four teledermatologists independently diagnosed and assessed the images taken of the lesions. Primary outcome measures were sensitivity, specificity, and area under the curve of the receiver operating characteristic (AUROC). Results A total of 569 images were taken from 495 patients (n=280, 57% women, n=215, 43% men; mean age 55, SD 17 years) for this study. On these images, CNN-S achieved an AUROC of 0.861 (95% CI 0.830-0.889; P<.001), and CNN-S2 achieved an AUROC of 0.831 (95% CI 0.798-0.861; P=.009), with both outperforming CNN-NS, which achieved an AUROC of 0.759 (95% CI 0.722-0.794; P<.001; P=.009). When the CNNs were matched to the mean sensitivity and specificity of the teledermatologists, the model’s resultant sensitivities and specificities were surpassed by the teledermatologists. However, when compared to CNN-S, the differences were not statistically significant (P=.10; P=.05). Performance across all CNN models and teledermatologists was influenced by the image quality. Conclusions CNNs trained on standardized images had improved performance and therefore greater generalizability in skin cancer classification when applied to an unseen data set. This is an important consideration for future algorithm development, regulation, and approval. Further, when tested on these unseen test images, the teledermatologists clinically outperformed all the CNN models; however, the difference was deemed to be statistically insignificant when compared to CNN-S. Conflicts of Interest VM received speakers fees from Merck, Eli Lily, Novartis and Bristol Myers Squibb. VM is the principal investigator for a clinical trial funded by the Victorian Department of Health and Human Services with 1:1 contribution from MoleMap.

Get full-text (via PubEx)

Assessing Generalizability of Deep Learning Models Trained on Standardized and Nonstandardized Images and Their Performance Against Teledermatologists (Preprint)

10.2196/preprints.35391 ◽

2021 ◽

Author(s):

Ibukun Oloruntoba ◽

Toan D Nguyen ◽

Zongyuan Ge ◽

Tine Vestergaard ◽

Victoria Mar

Keyword(s):

Skin Cancer ◽

Area Under The Curve ◽

Image Data ◽

Skin Lesions ◽

Training Image ◽

Data Sets ◽

Image Capture ◽

Data Set ◽

Unseen Data ◽

Test Variability

BACKGROUND Convolutional neural networks (CNNs) are a type of artificial intelligence that show promise as a diagnostic aid for skin cancer. However, the majority are trained using retrospective image data sets of varying quality and image capture standardization. OBJECTIVE The aim of our study is to use CNN models with the same architecture, but different training image sets, and test variability in performance when classifying skin cancer images in different populations, acquired with different devices. Additionally, we wanted to assess the performance of the models against Danish teledermatologists when tested on images acquired from Denmark. METHODS Three CNNs with the same architecture were trained. CNN-NS was trained on 25,331 nonstandardized images taken from the International Skin Imaging Collaboration using different image capture devices. CNN-S was trained on 235,268 standardized images, and CNN-S2 was trained on 25,331 standardized images (matched for number and classes of training images to CNN-NS). Both standardized data sets (CNN-S and CNN-S2) were provided by Molemap using the same image capture device. A total of 495 Danish patients with 569 images of skin lesions predominantly involving Fitzpatrick skin types II and III were used to test the performance of the models. Four teledermatologists independently diagnosed and assessed the images taken of the lesions. Primary outcome measures were sensitivity, specificity, and area under the curve of the receiver operating characteristic (AUROC). RESULTS A total of 569 images were taken from 495 patients (n=280, 57% women, n=215, 43% men; mean age 55, SD 17 years) for this study. On these images, CNN-S achieved an AUROC of 0.861 (95% CI 0.830-0.889; P<.001), and CNN-S2 achieved an AUROC of 0.831 (95% CI 0.798-0.861; P=.009), with both outperforming CNN-NS, which achieved an AUROC of 0.759 (95% CI 0.722-0.794; P<.001; P=.009). When the CNNs were matched to the mean sensitivity and specificity of the teledermatologists, the model’s resultant sensitivities and specificities were surpassed by the teledermatologists. However, when compared to CNN-S, the differences were not statistically significant (P=.10; P=.05). Performance across all CNN models and teledermatologists was influenced by the image quality. CONCLUSIONS CNNs trained on standardized images had improved performance and therefore greater generalizability in skin cancer classification when applied to an unseen data set. This is an important consideration for future algorithm development, regulation, and approval. Further, when tested on these unseen test images, the teledermatologists clinically outperformed all the CNN models; however, the difference was deemed to be statistically insignificant when compared to CNN-S.

Get full-text (via PubEx)

Assessing Generalisability of Deep Learning Models Trained on Standardised and Non-Standardised Images and their Performance against Tele-dermatologists (Preprint)

10.2196/preprints.35150 ◽

2021 ◽

Author(s):

Ibukun Oloruntoba ◽

Tine Vestergaard ◽

Toan D Nguyen ◽

Zongyuan Ge ◽

Victoria Mar

Keyword(s):

Skin Cancer ◽

Area Under The Curve ◽

Skin Lesions ◽

Training Image ◽

Image Capture ◽

Human Ethics ◽

The Difference ◽

Improved Performance ◽

Test Variability ◽

Sensitivity Specificity

BACKGROUND Convolutional neural networks (CNNs) are a type of artificial intelligence (AI) which show promise as a diagnostic aid for skin cancer. However, the majority are trained using retrospective image datasets of varying quality and image capture standardisation. OBJECTIVE The objective of our study was to use CNN models with the same architecture, but different training image sets, and test variability in performance when classifying skin cancer images in different populations, acquired with different devices. Additionally, we wanted to assess the performance of the models against Danish tele-dermatologists, when tested on images acquired from Denmark. METHODS Three CNNs with the same architecture were trained. CNN-NS was trained on 25,331 non- standardised images taken from the International Skin Imaging Collaboration using different image capture devices. CNN-S was trained on 235,268 standardised images and CNN-S2 was trained on 25,331 standardised images (matched for number and classes of training images to CNN-NS). Both standardised datasets (CNN-S and CNN-S2) were provided by Molemap using the same image capture device. 495 Danish patients with 569 images of skin lesions predominantly involving Fitzpatrick's skin types II and III were used to test the performance of the models. 4 tele-dermatologists independently diagnosed and assessed the images taken of the lesions. Primary outcome measures were sensitivity, specificity and area under the curve of the receiver operating characteristic (AUROC). RESULTS 569 images were taken from 495 patients (280 women [57%], 215 men [43%]; mean age 55 years [17 SD]) for this study. On these images, CNN-S achieved an AUROC of 0.861 (CI 0.830 – 0.889; P=.001) and CNN-S2 achieved an AUROC of 0.831 (CI 0.798 – 0.861; P=.009), with both outperforming CNN-NS, which achieved an AUROC of 0.759 (CI 0.722 – 0.794; P=.001, P=.009) (Figure 1). When the CNNs were matched to the mean sensitivity and specificity of the tele-dermatologists, the model’s resultant sensitivities and specificities were surpassed by the tele-dermatologists (Table 1). However, when compared to CNN-S, the differences were not statistically significant (P=.10, P=.053). Performance across all CNN models as well as tele- dermatologists was influenced by image quality. CONCLUSIONS CNNs trained on standardised images had improved performance and therefore greater generalisability in skin cancer classification when applied to an unseen dataset. This is an important consideration for future algorithm development, regulation and approval. Further, when tested on these unseen test images, the tele-dermatologists ‘clinically’ outperformed all the CNN models; however, the difference was deemed to be statistically insignificant when compared to CNN-S. CLINICALTRIAL This retrospective diagnostic comparative study was approved by the Monash University Human Ethics Committee, Melbourne, Australia (Project ID: 28130).

Get full-text (via PubEx)

Lung Segmentation in 4D CT Volumes Based on Robust Active Shape Model Matching

International Journal of Biomedical Imaging ◽

10.1155/2015/125648 ◽

2015 ◽

Vol 2015 ◽

pp. 1-9 ◽

Cited By ~ 1

Author(s):

Gurman Gill ◽

Reinhard R. Beichel

Keyword(s):

Image Data ◽

Active Shape Model ◽

Data Sets ◽

Segmentation Method ◽

Data Set ◽

Shape Model ◽

Lung Segmentation ◽

4D Ct ◽

Active Shape ◽

Ct Data

Dynamic and longitudinal lung CT imaging produce 4D lung image data sets, enabling applications like radiation treatment planning or assessment of response to treatment of lung diseases. In this paper, we present a 4D lung segmentation method that mutually utilizes all individual CT volumes to derive segmentations for each CT data set. Our approach is based on a 3D robust active shape model and extends it to fully utilize 4D lung image data sets. This yields an initial segmentation for the 4D volume, which is then refined by using a 4D optimal surface finding algorithm. The approach was evaluated on a diverse set of 152 CT scans of normal and diseased lungs, consisting of total lung capacity and functional residual capacity scan pairs. In addition, a comparison to a 3D segmentation method and a registration based 4D lung segmentation approach was performed. The proposed 4D method obtained an average Dice coefficient of0.9773±0.0254, which was statistically significantly better (pvalue≪0.001) than the 3D method (0.9659±0.0517). Compared to the registration based 4D method, our method obtained better or similar performance, but was 58.6% faster. Also, the method can be easily expanded to process 4D CT data sets consisting of several volumes.

Get full-text (via PubEx)

Model Distribution Effects on Likelihood Ratios in Fire Debris Analysis

Separations ◽

10.3390/separations5030044 ◽

2018 ◽

Vol 5 (3) ◽

pp. 44 ◽

Cited By ~ 3

Author(s):

Alyssa Allen ◽

Mary Williams ◽

Nicholas Thurn ◽

Michael Sigman

Keyword(s):

Computational Models ◽

Area Under The Curve ◽

Ground Truth ◽

Data Sets ◽

Likelihood Ratios ◽

Data Set ◽

Discriminant Model ◽

Fire Debris ◽

Characteristic Area ◽

Ignitable Liquid

Computational models for determining the strength of fire debris evidence based on likelihood ratios (LR) were developed and validated against data sets derived from different distributions of ASTM E1618-14 designated ignitable liquid class and substrate pyrolysis contributions using in-silico generated data. The models all perform well in cross validation against the distributions used to generate the model. However, a model generated based on data that does not contain representatives from all of the ASTM E1618-14 classes does not perform well in validation with data sets that contain representatives from the missing classes. A quadratic discriminant model based on a balanced data set (ignitable liquid versus substrate pyrolysis), with a uniform distribution of the ASTM E1618-14 classes, performed well (receiver operating characteristic area under the curve of 0.836) when tested against laboratory-developed casework-relevant samples of known ground truth.

Get full-text (via PubEx)

Novel Approaches to Smoothing and Comparing SELDI TOF Spectra

Cancer Informatics ◽

10.1177/117693510500100109 ◽

2005 ◽

Vol 1 ◽

pp. 117693510500100 ◽

Cited By ~ 4

Author(s):

Sreelatha Meleth ◽

Isam-Eldin Eltoum ◽

Liu Zhu ◽

Denise Oelschlager ◽

Chandrika Piyathilake ◽

...

Keyword(s):

Spectral Analysis ◽

Fourier Transforms ◽

Area Under The Curve ◽

Maximum Intensity ◽

Data Sets ◽

Intensity Level ◽

Prominent Feature ◽

Data Set ◽

The Third ◽

Novel Approaches

Background Most published literature using SELDI-TOF has used traditional techniques in Spectral Analysis such as Fourier transforms and wavelets for denoising. Most of these publications also compare spectra using their most prominent feature, ie, peaks or local maximums. Methods The maximum intensity value within each window of differentiable m/z values was used to represent the intensity level in that window. We also calculated the ‘Area under the Curve’ (AUC) spanned by each window. Results Keeping everything else constant, such as pre-processing of the data and the classifier used, the AUC performed much better as a metric of comparison than the peaks in two out of three data sets. In the third data set both metrics performed equivalently. Conclusions This study shows that the feature used to compare spectra can have an impact on the results of a study attempting to identify biomarkers using SELDI TOF data.

Get full-text (via PubEx)

An Integration of Cardiovascular Event Data and Machine Learning Models for Cardiac Arrest Predictions

International Journal of Health Sciences and Pharmacy ◽

10.47992/ijhsp.2581.6411.0061 ◽

2021 ◽

pp. 55-71

Author(s):

Krishna Prasad K ◽

Aithal P. S. ◽

Navin N. Bappalige ◽

Soumya S

Keyword(s):

Machine Learning ◽

Cardiac Arrest ◽

Area Under The Curve ◽

Computer Applications ◽

Data Sets ◽

Cardiovascular Risks ◽

Data Set ◽

Average Area ◽

Learning Classifier ◽

Tree Classifier

Purpose: Predicting and then preventing cardiac arrest of a patient in ICU is the most challenging phase even for a most highly skilled professional. The data been collected in ICU for a patient are huge, and the selection of a portion of data for preventing cardiac arrest in a quantum of time is highly decisive, analysing and predicting that large data require an effective system. An effective integration of computer applications and cardiovascular data is necessary to predict the cardiovascular risks. A machine learning technique is the right choice in the advent of technology to manage patients with cardiac arrest. Methodology: In this work we have collected and merged three data sets, Cleveland Dataset of US patients with total 303 records, Statlog Dataset of UK patients with 270 records, and Hungarian dataset of Hungary, Switzerland with 617 records. These data are the most comprehensive data set with a combination of all three data sets consisting of 11 common features with 1190 records. Findings/Results: Feature extraction phase extracts 7 features, which contribute to the event. In addition, extracted features are used to train the selected machine learning classifier models, and results are obtained and obtained results are then evaluated using test data and final results are drawn. Extra Tree Classifier has the highest value of 0.957 for average area under the curve (AUC). Originality: The originality of this combined Dataset analysis using machine learning classifier model results Extra Tree Classifier with highest value of 0.957 for average area under the curve (AUC). Paper Type: Experimental Research Keywords: Cardiac, Machine Learning, Random Forest, XBOOST, ROC AUC, ST Slope.

Get full-text (via PubEx)

Cross Data Set Generalization of Ultrasound Image Augmentation using Representation Learning: A Case Study

Current Directions in Biomedical Engineering ◽

10.1515/cdbme-2021-2193 ◽

2021 ◽

Vol 7 (2) ◽

pp. 755-758

Author(s):

Daniel Wulff ◽

Mohamad Mehdi ◽

Floris Ernst ◽

Jannis Hagenah

Keyword(s):

Data Augmentation ◽

Ultrasound Image ◽

Image Data ◽

Representation Learning ◽

Generative Adversarial Networks ◽

Data Sets ◽

Imaging Data ◽

Data Set ◽

Adversarial Networks ◽

Classical Image

Abstract Data augmentation is a common method to make deep learning assessible on limited data sets. However, classical image augmentation methods result in highly unrealistic images on ultrasound data. Another approach is to utilize learning-based augmentation methods, e.g. based on variational autoencoders or generative adversarial networks. However, a large amount of data is necessary to train these models, which is typically not available in scenarios where data augmentation is needed. One solution for this problem could be a transfer of augmentation models between different medical imaging data sets. In this work, we present a qualitative study of the cross data set generalization performance of different learning-based augmentation methods for ultrasound image data. We could show that knowledge transfer is possible in ultrasound image augmentation and that the augmentation partially results in semantically meaningful transfers of structures, e.g. vessels, across domains.

Get full-text (via PubEx)

Information Extraction: Statistical Analysis to Get the Most from Spectrum Images

Microscopy and Microanalysis ◽

10.1017/s1431927600037752 ◽

2000 ◽

Vol 6 (S2) ◽

pp. 1052-1053

Author(s):

P. G. Kotula ◽

M. R. Keenan

Keyword(s):

Image Data ◽

Relevant Information ◽

Data Sets ◽

Double Precision ◽

Data Set ◽

X Ray ◽

Conventional Analysis ◽

Actual Signal ◽

Spectrum Imaging ◽

Automated Methods

As more x-ray energy dispersive spectroscopy (EDS) manufacturers begin to offer spectrum imaging (a complete x-ray spectrum from each pixel in an image), there is a clear need for robust and automated methods for quickly extracting the relevant information from the large spectrum image data sets. A typical spectrum image may consist of 100 x 100 pixels (10000 spectra) each with 1000 channels which (when stored at double precision) is 80 Mbytes. It is clear that a large four-dimensional data set such as this cannot be viewed in its entirety and the time to analyze individual spectra by hand is prohibitive. Conventional analysis of spectrum images by mapping energy windows is useful as a first pass only for finding the elements present and only if at sufficient concentrations. Additional problems with mapping include systematic overlaps of other x-ray peaks, changes in the background shape and displaying the maps so they faithfully portray the actual signal intensity.

Get full-text (via PubEx)

Integration of Leaky-Integrate-and-Fire Neurons in Standard Machine Learning Architectures to Generate Hybrid Networks: A Surrogate Gradient Approach

Neural Computation ◽

10.1162/neco_a_01424 ◽

2021 ◽

pp. 1-26

Author(s):

Richard C. Gerum ◽

Achim Schilling

Keyword(s):

Machine Learning ◽

Learning Community ◽

Gradient Methods ◽

Image Data ◽

Classification Performance ◽

Data Sets ◽

Neuron Models ◽

Data Set ◽

Integrate And Fire ◽

Gradient Approach

Up to now, modern machine learning (ML) has been based on approximating big data sets with high-dimensional functions, taking advantage of huge computational resources. We show that biologically inspired neuron models such as the leaky-integrate-and-fire (LIF) neuron provide novel and efficient ways of information processing. They can be integrated in machine learning models and are a potential target to improve ML performance. Thus, we have derived simple update rules for LIF units to numerically integrate the differential equations. We apply a surrogate gradient approach to train the LIF units via backpropagation. We demonstrate that tuning the leak term of the LIF neurons can be used to run the neurons in different operating modes, such as simple signal integrators or coincidence detectors. Furthermore, we show that the constant surrogate gradient, in combination with tuning the leak term of the LIF units, can be used to achieve the learning dynamics of more complex surrogate gradients. To prove the validity of our method, we applied it to established image data sets (the Oxford 102 flower data set, MNIST), implemented various network architectures, used several input data encodings and demonstrated that the method is suitable to achieve state-of-the-art classification performance. We provide our method as well as further surrogate gradient methods to train spiking neural networks via backpropagation as an open-source KERAS package to make it available to the neuroscience and machine learning community. To increase the interpretability of the underlying effects and thus make a small step toward opening the black box of machine learning, we provide interactive illustrations, with the possibility of systematically monitoring the effects of parameter changes on the learning characteristics.

Get full-text (via PubEx)

Methodology for eliminating imbalance of image data sets

Bulletin of the National Technical University KhPI A series of Information and Modeling ◽

10.20998/2411-0558.2021.02.04 ◽

2021 ◽

Vol 1 (2 (6)) ◽

Author(s):

Tatyana Biloborodova ◽

Inna Skarga-Bandurova ◽

Mark Koverga

Keyword(s):

Feature Extraction ◽

Reinforcement Learning ◽

Key Words ◽

Class Imbalance ◽

Image Data ◽

Unbalanced Data ◽

Data Sets ◽

Learning Technology ◽

Data Set ◽

Image Fragment

The methodology of solving the problem of eliminating class imbalance in image data sets is presented. The proposed methodology includes the stages of image fragment extraction, fragment augmentation, feature extraction, duplication of minority objects, and is based on reinforcement learning technology. The degree of imbalance indicator was used as a measure to determine the imbalance of the data set. An experiment was performed using a set of images of the faces of patients with skin rashes, annotated according to the severity of acne. The main steps of the methodology implementation are considered. The results of the classification showed the feasibility of applying the proposed methodology. The accuracy of classification on test data was 85%, which is 5% higher than the result obtained without the use of the proposed methodology. Key words: class imbalance, unbalanced data set, image fragment extraction, augmentation.

Get full-text (via PubEx)