scholarly journals Differential Biases and Variabilities of Deep Learning–Based Artificial Intelligence and Human Experts in Clinical Diagnosis: Retrospective Cohort and Survey Study (Preprint)

2021 ◽  
Author(s):  
Dongchul Cha ◽  
Chongwon Pae ◽  
Se A Lee ◽  
Gina Na ◽  
Young Kyun Hur ◽  
...  

BACKGROUND Deep learning (DL)–based artificial intelligence may have different diagnostic characteristics than human experts in medical diagnosis. As a data-driven knowledge system, heterogeneous population incidence in the clinical world is considered to cause more bias to DL than clinicians. Conversely, by experiencing limited numbers of cases, human experts may exhibit large interindividual variability. Thus, understanding how the 2 groups classify given data differently is an essential step for the cooperative usage of DL in clinical application. OBJECTIVE This study aimed to evaluate and compare the differential effects of clinical experience in otoendoscopic image diagnosis in both computers and physicians exemplified by the class imbalance problem and guide clinicians when utilizing decision support systems. METHODS We used digital otoendoscopic images of patients who visited the outpatient clinic in the Department of Otorhinolaryngology at Severance Hospital, Seoul, South Korea, from January 2013 to June 2019, for a total of 22,707 otoendoscopic images. We excluded similar images, and 7500 otoendoscopic images were selected for labeling. We built a DL-based image classification model to classify the given image into 6 disease categories. Two test sets of 300 images were populated: balanced and imbalanced test sets. We included 14 clinicians (otolaryngologists and nonotolaryngology specialists including general practitioners) and 13 DL-based models. We used accuracy (overall and per-class) and kappa statistics to compare the results of individual physicians and the ML models. RESULTS Our ML models had consistently high accuracies (balanced test set: mean 77.14%, SD 1.83%; imbalanced test set: mean 82.03%, SD 3.06%), equivalent to those of otolaryngologists (balanced: mean 71.17%, SD 3.37%; imbalanced: mean 72.84%, SD 6.41%) and far better than those of nonotolaryngologists (balanced: mean 45.63%, SD 7.89%; imbalanced: mean 44.08%, SD 15.83%). However, ML models suffered from class imbalance problems (balanced test set: mean 77.14%, SD 1.83%; imbalanced test set: mean 82.03%, SD 3.06%). This was mitigated by data augmentation, particularly for low incidence classes, but rare disease classes still had low per-class accuracies. Human physicians, despite being less affected by prevalence, showed high interphysician variability (ML models: kappa=0.83, SD 0.02; otolaryngologists: kappa=0.60, SD 0.07). CONCLUSIONS Even though ML models deliver excellent performance in classifying ear disease, physicians and ML models have their own strengths. ML models have consistent and high accuracy while considering only the given image and show bias toward prevalence, whereas human physicians have varying performance but do not show bias toward prevalence and may also consider extra information that is not images. To deliver the best patient care in the shortage of otolaryngologists, our ML model can serve a cooperative role for clinicians with diverse expertise, as long as it is kept in mind that models consider only images and could be biased toward prevalent diseases even after data augmentation.

2021 ◽  
Vol 11 (5) ◽  
pp. 2039
Author(s):  
Hyunseok Shin ◽  
Sejong Oh

In machine learning applications, classification schemes have been widely used for prediction tasks. Typically, to develop a prediction model, the given dataset is divided into training and test sets; the training set is used to build the model and the test set is used to evaluate the model. Furthermore, random sampling is traditionally used to divide datasets. The problem, however, is that the performance of the model is evaluated differently depending on how we divide the training and test sets. Therefore, in this study, we proposed an improved sampling method for the accurate evaluation of a classification model. We first generated numerous candidate cases of train/test sets using the R-value-based sampling method. We evaluated the similarity of distributions of the candidate cases with the whole dataset, and the case with the smallest distribution–difference was selected as the final train/test set. Histograms and feature importance were used to evaluate the similarity of distributions. The proposed method produces more proper training and test sets than previous sampling methods, including random and non-random sampling.


Author(s):  
Ying-Chun Pan ◽  
Hsun-Liang Chan ◽  
Xiangbo Kong ◽  
Lubomir M. Hadjiiski ◽  
Oliver D. Kripfgans

Objectives: Ultrasound emerges as a complement to cone-beam computed tomography in dentistry, but struggles with artifacts like reverberation and shadowing. This study seeks to help novice users recognize soft tissue, bone, and crown of a dental sonogram, and automate soft tissue height (STH) measurement using deep learning. Methods: In this retrospective study, 627 frames from 111 independent cine loops of mandibular and maxillary premolar and incisors collected from our porcine model (N = 8) were labeled by a reader. 274 premolar sonograms, including data augmentation, were used to train a multi class segmentation model. The model was evaluated against several test sets, including premolar of the same breed (n = 74, Yucatan) and premolar of a different breed (n = 120, Sinclair). We further proposed a rule-based algorithm to automate STH measurements using predicted segmentation masks. Results: The model reached a Dice similarity coefficient of 90.7±4.39%, 89.4±4.63%, and 83.7±10.5% for soft tissue, bone, and crown segmentation, respectively on the first test set (n = 74), and 90.0±7.16%, 78.6±13.2%, and 62.6±17.7% on the second test set (n = 120). The automated STH measurements have a mean difference (95% confidence interval) of −0.22 mm (−1.4, 0.95), a limit of agreement of 1.2 mm, and a minimum ICC of 0.915 (0.857, 0.948) when compared to expert annotation. Conclusion: This work demonstrates the potential use of deep learning in identifying periodontal structures on sonograms and obtaining diagnostic periodontal dimensions.


2020 ◽  
Vol 2020 ◽  
pp. 1-6
Author(s):  
Zhehao He ◽  
Wang Lv ◽  
Jian Hu

Background. The differential diagnosis of subcentimetre lung nodules with a diameter of less than 1 cm has always been one of the problems of imaging doctors and thoracic surgeons. We plan to create a deep learning model for the diagnosis of pulmonary nodules in a simple method. Methods. Image data and pathological diagnosis of patients come from the First Affiliated Hospital of Zhejiang University School of Medicine from October 1, 2016, to October 1, 2019. After data preprocessing and data augmentation, the training set is used to train the model. The test set is used to evaluate the trained model. At the same time, the clinician will also diagnose the test set. Results. A total of 2,295 images of 496 lung nodules and their corresponding pathological diagnosis were selected as a training set and test set. After data augmentation, the number of training set images reached 12,510 images, including 6,648 malignant nodular images and 5,862 benign nodular images. The area under the P-R curve of the trained model is 0.836 in the classification of malignant and benign nodules. The area under the ROC curve of the trained model is 0.896 (95% CI: 78.96%~100.18%), which is higher than that of three doctors. However, the P value is not less than 0.05. Conclusion. With the help of an automatic machine learning system, clinicians can create a deep learning pulmonary nodule pathology classification model without the help of deep learning experts. The diagnostic efficiency of this model is not inferior to that of the clinician.


2021 ◽  
Vol 39 (15_suppl) ◽  
pp. 8536-8536
Author(s):  
Gouji Toyokawa ◽  
Fahdi Kanavati ◽  
Seiya Momosaki ◽  
Kengo Tateishi ◽  
Hiroaki Takeoka ◽  
...  

8536 Background: Lung cancer is the leading cause of cancer-related death in many countries, and its prognosis remains unsatisfactory. Since treatment approaches differ substantially based on the subtype, such as adenocarcinoma (ADC), squamous cell carcinoma (SCC) and small cell lung cancer (SCLC), an accurate histopathological diagnosis is of great importance. However, if the specimen is solely composed of poorly differentiated cancer cells, distinguishing between histological subtypes can be difficult. The present study developed a deep learning model to classify lung cancer subtypes from whole slide images (WSIs) of transbronchial lung biopsy (TBLB) specimens, in particular with the aim of using this model to evaluate a challenging test set of indeterminate cases. Methods: Our deep learning model consisted of two separately trained components: a convolutional neural network tile classifier and a recurrent neural network tile aggregator for the WSI diagnosis. We used a training set consisting of 638 WSIs of TBLB specimens to train a deep learning model to classify lung cancer subtypes (ADC, SCC and SCLC) and non-neoplastic lesions. The training set consisted of 593 WSIs for which the diagnosis had been determined by pathologists based on the visual inspection of Hematoxylin-Eosin (HE) slides and of 45 WSIs of indeterminate cases (64 ADCs and 19 SCCs). We then evaluated the models using five independent test sets. For each test set, we computed the receiver operator curve (ROC) area under the curve (AUC). Results: We applied the model to an indeterminate test set of WSIs obtained from TBLB specimens that pathologists had not been able to conclusively diagnose by examining the HE-stained specimens alone. Overall, the model achieved ROC AUCs of 0.993 (confidence interval [CI] 0.971-1.0) and 0.996 (0.981-1.0) for ADC and SCC, respectively. We further evaluated the model using five independent test sets consisting of both TBLB and surgically resected lung specimens (combined total of 2490 WSIs) and obtained highly promising results with ROC AUCs ranging from 0.94 to 0.99. Conclusions: In this study, we demonstrated that a deep learning model could be trained to predict lung cancer subtypes in indeterminate TBLB specimens. The extremely promising results obtained show that if deployed in clinical practice, a deep learning model that is capable of aiding pathologists in diagnosing indeterminate cases would be extremely beneficial as it would allow a diagnosis to be obtained sooner and reduce costs that would result from further investigations.


Author(s):  
Kalva Sindhu Priya

Abstract: In the present scenario, it is quite aware that almost every field is moving into machine based automation right from fundamentals to master level systems. Among them, Machine Learning (ML) is one of the important tool which is most similar to Artificial Intelligence (AI) by allowing some well known data or past experience in order to improve automatically or estimate the behavior or status of the given data through various algorithms. Modeling a system or data through Machine Learning is important and advantageous as it helps in the development of later and newer versions. Today most of the information technology giants such as Facebook, Uber, Google maps made Machine learning as a critical part of their ongoing operations for the better view of users. In this paper, various available algorithms in ML is given briefly and out of all the existing different algorithms, Linear Regression algorithm is used to predict a new set of values by taking older data as reference. However, a detailed predicted model is discussed clearly by building a code with the help of Machine Learning and Deep Learning tool in MATLAB/ SIMULINK. Keywords: Machine Learning (ML), Linear Regression algorithm, Curve fitting, Root Mean Squared Error


Author(s):  
Uzma Batool ◽  
Mohd Ibrahim Shapiai ◽  
Nordinah Ismail ◽  
Hilman Fauzi ◽  
Syahrizal Salleh

Silicon wafer defect data collected from fabrication facilities is intrinsically imbalanced because of the variable frequencies of defect types. Frequently occurring types will have more influence on the classification predictions if a model gets trained on such skewed data. A fair classifier for such imbalanced data requires a mechanism to deal with type imbalance in order to avoid biased results. This study has proposed a convolutional neural network for wafer map defect classification, employing oversampling as an imbalance addressing technique. To have an equal participation of all classes in the classifier’s training, data augmentation has been employed, generating more samples in minor classes. The proposed deep learning method has been evaluated on a real wafer map defect dataset and its classification results on the test set returned a 97.91% accuracy. The results were compared with another deep learning based auto-encoder model demonstrating the proposed method, a potential approach for silicon wafer defect classification that needs to be investigated further for its robustness.


Diagnostics ◽  
2020 ◽  
Vol 10 (5) ◽  
pp. 261
Author(s):  
Tae-Young Heo ◽  
Kyoung Min Kim ◽  
Hyun Kyu Min ◽  
Sun Mi Gu ◽  
Jae Hyun Kim ◽  
...  

The use of deep-learning-based artificial intelligence (AI) is emerging in ophthalmology, with AI-mediated differential diagnosis of neovascular age-related macular degeneration (AMD) and dry AMD a promising methodology for precise treatment strategies and prognosis. Here, we developed deep learning algorithms and predicted diseases using 399 images of fundus. Based on feature extraction and classification with fully connected layers, we applied the Visual Geometry Group with 16 layers (VGG16) model of convolutional neural networks to classify new images. Image-data augmentation in our model was performed using Keras ImageDataGenerator, and the leave-one-out procedure was used for model cross-validation. The prediction and validation results obtained using the AI AMD diagnosis model showed relevant performance and suitability as well as better diagnostic accuracy than manual review by first-year residents. These results suggest the efficacy of this tool for early differential diagnosis of AMD in situations involving shortages of ophthalmology specialists and other medical devices.


2021 ◽  
Vol 2021 ◽  
pp. 1-11
Author(s):  
Yong Liang ◽  
Qi Cui ◽  
Xing Luo ◽  
Zhisong Xie

Rock classification is a significant branch of geology which can help understand the formation and evolution of the planet, search for mineral resources, and so on. In traditional methods, rock classification is usually done based on the experience of a professional. However, this method has problems such as low efficiency and susceptibility to subjective factors. Therefore, it is of great significance to establish a simple, fast, and accurate rock classification model. This paper proposes a fine-grained image classification network combining image cutting method and SBV algorithm to improve the classification performance of a small number of fine-grained rock samples. The method uses image cutting to achieve data augmentation without adding additional datasets and uses image block voting scoring to obtain richer complementary information, thereby improving the accuracy of image classification. The classification accuracy of 32 images is 75%, 68.75%, and 75%. The results show that the method proposed in this paper has a significant improvement in the accuracy of image classification, which is 34.375%, 18.75%, and 43.75% higher than that of the original algorithm. It verifies the effectiveness of the algorithm in this paper and at the same time proves that deep learning has great application value in the field of geology.


2021 ◽  
pp. bjophthalmol-2020-316290
Author(s):  
Bing Li ◽  
Huan Chen ◽  
Bilei Zhang ◽  
Mingzhen Yuan ◽  
Xuemin Jin ◽  
...  

AimTo explore and evaluate an appropriate deep learning system (DLS) for the detection of 12 major fundus diseases using colour fundus photography.MethodsDiagnostic performance of a DLS was tested on the detection of normal fundus and 12 major fundus diseases including referable diabetic retinopathy, pathologic myopic retinal degeneration, retinal vein occlusion, retinitis pigmentosa, retinal detachment, wet and dry age-related macular degeneration, epiretinal membrane, macula hole, possible glaucomatous optic neuropathy, papilledema and optic nerve atrophy. The DLS was developed with 56 738 images and tested with 8176 images from one internal test set and two external test sets. The comparison with human doctors was also conducted.ResultsThe area under the receiver operating characteristic curves of the DLS on the internal test set and the two external test sets were 0.950 (95% CI 0.942 to 0.957) to 0.996 (95% CI 0.994 to 0.998), 0.931 (95% CI 0.923 to 0.939) to 1.000 (95% CI 0.999 to 1.000) and 0.934 (95% CI 0.929 to 0.938) to 1.000 (95% CI 0.999 to 1.000), with sensitivities of 80.4% (95% CI 79.1% to 81.6%) to 97.3% (95% CI 96.7% to 97.8%), 64.6% (95% CI 63.0% to 66.1%) to 100% (95% CI 100% to 100%) and 68.0% (95% CI 67.1% to 68.9%) to 100% (95% CI 100% to 100%), respectively, and specificities of 89.7% (95% CI 88.8% to 90.7%) to 98.1% (95%CI 97.7% to 98.6%), 78.7% (95% CI 77.4% to 80.0%) to 99.6% (95% CI 99.4% to 99.8%) and 88.1% (95% CI 87.4% to 88.7%) to 98.7% (95% CI 98.5% to 99.0%), respectively. When compared with human doctors, the DLS obtained a higher diagnostic sensitivity but lower specificity.ConclusionThe proposed DLS is effective in diagnosing normal fundus and 12 major fundus diseases, and thus has much potential for fundus diseases screening in the real world.


2021 ◽  
Author(s):  
Peng Zhang ◽  
Fan Lin ◽  
Fei Ma ◽  
Yuting Chen ◽  
Daowen Wang ◽  
...  

SummaryBackgroundWith the increasing demand for atrial fibrillation (AF) screening, clinicians spend a significant amount of time in identifying the AF signals from massive electrocardiogram (ECG) data in long-term dynamic ECG monitoring. In this study, we aim to reduce clinicians’ workload and promote AF screening by using artificial intelligence (AI) to automatically detect AF episodes and identify AF patients in 24 h Holter recording.MethodsWe used a total of 22 979 Holter recordings (24 h) from 22 757 adult patients and established accurate annotations for AF by cardiologists. First, a randomized clinical cohort of 3 000 recordings (1 500 AF and 1 500 non-AF) from 3000 patients recorded between April 2012 and May 2020 was collected and randomly divided into training, validation and test sets (10:1:4). Then, a deep-learning-based AI model was developed to automatically detect AF episode using RR intervals and was tested with the test set. Based on AF episode detection results, AF patients were automatically identified by using a criterion of at least one AF episode of 6 min or longer. Finally, the clinical effectiveness of the model was verified with an independent real-world test set including 19 979 recordings (1 006 AF and 18 973 non-AF) from 19 757 consecutive patients recorded between June 2020 and January 2021.FindingsOur model achieved high performance for AF episode detection in both test sets (sensitivity: 0.992 and 0.972; specificity: 0.997 and 0.997, respectively). It also achieved high performance for AF patient identification in both test sets (sensitivity:0.993 and 0.994; specificity: 0.990 and 0.973, respectively). Moreover, it obtained superior and consistent performance in an external public database.InterpretationOur AI model can automatically identify AF in long-term ECG recording with high accuracy. This cost-effective strategy may promote AF screening by improving diagnostic effectiveness and reducing clinical workload.Research in contextEvidence before this studyWe searched Google Scholar and PubMed for research articles on artificial intelligence-based diagnosis of atrial fibrillation (AF) published in English between Jan 1, 2016 and Aug 1, 2021, using the search terms “deep learning” OR “deep neural network” OR “machine learning” OR “artificial intelligence” AND “atrial fibrillation”. We found that most of the previous deep learning models in AF detection were trained and validated on benchmark datasets (such as the PhysioNet database, the Massachusetts Institute of Technology Beth Israel Hospital AF database or Long-Term AF database), in which there were less than 100 patients or the recordings contained only short ECG segments (30-60s). Our search did not identify any articles that explored deep neural networks for AF detection in large real-world dataset of 24 h Holter recording, nor did we find articles that can automatically identify patients with AF in 24 h Holter recording.Added value of this studyFirst, long-term Holter monitoring is the main method of AF screening, however, most previous studies of automatic AF detection mainly tested on short ECG recordings. This work focused on 24 h Holter recording data and achieved high accuracy in detecting AF episodes. Second, AF episodes detection did not automatically transform to AF patient identification in 24 h Holter recording, since at present, there is no well-recognized criterion for automatically identifying AF patient. Therefore, we established a criterion to identify AF patients by use of at least one AF episode of 6 min or longer, as this condition led to significantly increased risk of thromboembolism. Using this criterion, our method identified AF patients with high accuracy. Finally, and more importantly, our model was trained on a randomized clinical dataset and tested on an independent real-world clinical dataset to show great potential in clinical application. We did not exclude rare or special cases in the real-world dataset so as not to inflate our AF detection performance. To the best of our knowledge, this is the first study to automatically identifies both AF episodes and AF patients in 24 h Holter recording of large real-world clinical dataset.Implications of all the available evidenceOur deep learning model automatically identified AF patient with high accuracy in 24 h Holter recording and was verified in real-world data, therefore, it can be embedded into the Holter analysis system and deployed at the clinical level to assist the decision making of Holter analysis system and clinicians. This approach can help improve the efficiency of AF screening and reduce the cost for AF diagnosis. In addition, our RR-interval-based model achieved comparable or better performance than the raw-ECG-based method, and can be widely applied to medical devices that can collect heartbeat information, including not only the multi-lead and single-lead Holter devices, but also other wearable devices that can reliably measure the heartbeat signals.


Sign in / Sign up

Export Citation Format

Share Document