scholarly journals Evaluation of artificial intelligence on a reference standard based on subjective interpretation

Author(s):  
Po-Hsuan Cameron Chen ◽  
Craig H Mermel ◽  
Yun Liu
PeerJ ◽  
2020 ◽  
Vol 8 ◽  
pp. e8854
Author(s):  
Fengdan Wang ◽  
Xiao Gu ◽  
Shi Chen ◽  
Yongliang Liu ◽  
Qing Shen ◽  
...  

Objective Bone age (BA) is a crucial indicator for revealing the growth and development of children. This study tested the performance of a fully automated artificial intelligence (AI) system for BA assessment of Chinese children with abnormal growth and development. Materials and Methods A fully automated AI system based on the Greulich and Pyle (GP) method was developed for Chinese children by using 8,000 BA radiographs from five medical centers nationwide in China. Then, a total of 745 cases (360 boys and 385 girls) with abnormal growth and development from another tertiary medical center of north China were consecutively collected between January and October 2018 to test the system. The reference standard was defined as the result interpreted by two experienced reviewers (a radiologist with 10 years and an endocrinologist with 15 years of experience in BA reading) through consensus using the GP atlas. BA accuracy within 1 year, root mean square error (RMSE), mean absolute difference (MAD), and 95% limits of agreement according to the Bland-Altman plot were statistically calculated. Results For Chinese pediatric patients with abnormal growth and development, the accuracy of this new automated AI system within 1 year was 84.60% as compared to the reference standard, with the highest percentage of 89.45% in the 12- to 18-year group. The RMSE, MAD, and 95% limits of agreement of the AI system were 0.76 years, 0.58 years, and −1.547 to 1.428, respectively, according to the Bland-Altman plot. The largest difference between the AI and experts’ BA result was noted for patients of short stature with bone deformities, severe osteomalacia, or different rates of maturation of the carpals and phalanges. Conclusions The developed automated AI system could achieve comparable BA results to experienced reviewers for Chinese children with abnormal growth and development.


2020 ◽  
pp. bjophthalmol-2020-316594 ◽  
Author(s):  
Peter Heydon ◽  
Catherine Egan ◽  
Louis Bolter ◽  
Ryan Chambers ◽  
John Anderson ◽  
...  

Background/aimsHuman grading of digital images from diabetic retinopathy (DR) screening programmes represents a significant challenge, due to the increasing prevalence of diabetes. We evaluate the performance of an automated artificial intelligence (AI) algorithm to triage retinal images from the English Diabetic Eye Screening Programme (DESP) into test-positive/technical failure versus test-negative, using human grading following a standard national protocol as the reference standard.MethodsRetinal images from 30 405 consecutive screening episodes from three English DESPs were manually graded following a standard national protocol and by an automated process with machine learning enabled software, EyeArt v2.1. Screening performance (sensitivity, specificity) and diagnostic accuracy (95% CIs) were determined using human grades as the reference standard.ResultsSensitivity (95% CIs) of EyeArt was 95.7% (94.8% to 96.5%) for referable retinopathy (human graded ungradable, referable maculopathy, moderate-to-severe non-proliferative or proliferative). This comprises sensitivities of 98.3% (97.3% to 98.9%) for mild-to-moderate non-proliferative retinopathy with referable maculopathy, 100% (98.7%,100%) for moderate-to-severe non-proliferative retinopathy and 100% (97.9%,100%) for proliferative disease. EyeArt agreed with the human grade of no retinopathy (specificity) in 68% (67% to 69%), with a specificity of 54.0% (53.4% to 54.5%) when combined with non-referable retinopathy.ConclusionThe algorithm demonstrated safe levels of sensitivity for high-risk retinopathy in a real-world screening service, with specificity that could halve the workload for human graders. AI machine learning and deep learning algorithms such as this can provide clinically equivalent, rapid detection of retinopathy, particularly in settings where a trained workforce is unavailable or where large-scale and rapid results are needed.


Diagnostics ◽  
2021 ◽  
Vol 11 (9) ◽  
pp. 1608
Author(s):  
Anne Schlickenrieder ◽  
Ole Meyer ◽  
Jule Schönewolf ◽  
Paula Engels ◽  
Reinhard Hickel ◽  
...  

The aim of the present study was to investigate the diagnostic performance of a trained convolutional neural network (CNN) for detecting and categorizing fissure sealants from intraoral photographs using the expert standard as reference. An image set consisting of 2352 digital photographs from permanent posterior teeth (461 unsealed tooth surfaces/1891 sealed surfaces) was divided into a training set (n = 1881/364/1517) and a test set (n = 471/97/374). All the images were scored according to the following categories: unsealed molar, intact, sufficient and insufficient sealant. Expert diagnoses served as the reference standard for cyclic training and repeated evaluation of the CNN (ResNeXt-101-32x8d), which was trained by using image augmentation and transfer learning. A statistical analysis was performed, including the calculation of contingency tables and areas under the receiver operating characteristic curve (AUC). The results showed that the CNN accurately detected sealants in 98.7% of all the test images, corresponding to an AUC of 0.996. The diagnostic accuracy and AUC were 89.6% and 0.951, respectively, for intact sealant; 83.2% and 0.888, respectively, for sufficient sealant; 92.4 and 0.942, respectively, for insufficient sealant. On the basis of the documented results, it was concluded that good agreement with the reference standard could be achieved for automatized sealant detection by using artificial intelligence methods. Nevertheless, further research is necessary to improve the model performance.


Author(s):  
David L. Poole ◽  
Alan K. Mackworth

Sign in / Sign up

Export Citation Format

Share Document