scholarly journals An Accurate and Explainable Deep Learning System Improves Interobserver Agreement in the Interpretation of Chest Radiograph

Author(s):  
Hieu Huy Pham ◽  
Ha Q. Nguyen ◽  
Khanh Lam ◽  
Linh T. Le ◽  
Dung B. Nguyen ◽  
...  

Interpretation of chest radiographs (CXR) is a difficult but essential task for detecting thoracic abnormalities. Recent artificial intelligence (AI) algorithms have achieved radiologist-level performance on various medical classification tasks. However, only a few studies addressed the localization of abnormal findings from CXR scans, which is essential in explaining the image-level classification to radiologists. Additionally, the actual impact of AI algorithms on the diagnostic performance of radiologists in clinical practice remains relatively unclear. To bridge these gaps, we developed an explainable deep learning system called VinDr-CXR that can classify a CXR scan into multiple thoracic diseases and, at the same time, localize most types of critical findings on the image. VinDr-CXR was trained on 51,485 CXR scans with radiologist-provided bounding box annotations. It demonstrated a comparable performance to experienced radiologists in classifying 6 common thoracic diseases on a retrospective validation set of 3,000 CXR scans, with a mean area under the receiver operating characteristic curve (AUROC) of 0.967 (95% confidence interval [CI]: 0.958-0.975). The sensitivity, specificity, F1-score, false-positive rate (FPR), and false-negative rate (FNR) of the system at the optimal cutoff value were 0.933 (0.898-0.964), 0.900 (0.887-0.911), 0.631 (0.589-0.672), 0.101 (0.089-0.114) and 0.067 (0.057-0.102), respectively. For the localization task with 14 types of lesions, our free-response receiver operating characteristic (FROC) analysis showed that the VinDr-CXR achieved a sensitivity of 80.2% at the rate of 1.0 false-positive lesion identified per scan. A prospective study was also conducted to measure the clinical impact of the VinDr-CXR in assisting six experienced radiologists. The results indicated that the proposed system, when used as a diagnosis supporting tool, significantly improved the agreement between radiologists themselves with an increase of 1.5% in mean Fleiss' Kappa. We also observed that, after the radiologists consulted VinDr-CXR's suggestions, the agreement between each of them and the system was remarkably increased by 3.3% in mean Cohen's Kappa. Altogether, our results highlight the potentials of the proposed deep learning system as an effective assistant to radiologists in clinical practice. Part of the dataset used for developing the VinDr-CXR system has been made publicly available at https://physionet.org/content/vindr-cxr/1.0.0/.

1981 ◽  
Vol 27 (9) ◽  
pp. 1569-1574 ◽  
Author(s):  
E A Robertson ◽  
M H Zweig

Abstract The usefulness of an analytical system in patient care is ultimately judged not by its analytical performance but by its clinical performance, i.e., its ability to separate apparently similar patients into two subgroups, one of which has a particular clinically important condition and another subgroup which does not. This clinical performance can be studied with the tools of signal detectability theory, originally developed to analyze the performance of radar and data-transmission systems. Each classification made by an analytical system may be categorized as a true-positive, true-negative, false-positive, or false-negative decision. For laboratory tests the proportion of decisions in each category depends on the biological overlap between the two subgroups, the analytical performance of the system, and the decision level chosen. The clinical performance of the analytical system for all possible decision levels is represented by the receiver operating characteristic curve, which plots the true-positive rate against the false-positive rate. The use of these curves permits comparison of alternative analytical techniques at equal true-positive rates and at all possible decision levels. These comparisons show the effect of analytical improvements on clinical performance.


Author(s):  
Monique Stewart ◽  
Hamed Pouryousef ◽  
Brian Marquis ◽  
Som P. Singh ◽  
Demet Cakdi

The Federal Railroad Administration (FRA) has partnered with Metro-North Railroad (MNR), Long Island Rail Road (LIRR) and New York & Atlantic Railway (NYA) to enhance operational safety through the implementation of wayside detection systems. Currently, MNR has a four-track Wheel Impact Load Detector (WILD) system that has been operating since 2010 near the Grand Central Terminal. This paper discusses a Receiver Operating Characteristic (ROC) analysis of this existing WILD system in conjunction with the wheel maintenance practices since 2010. Currently MNR’s operating procedures require a car with wheel(s) exhibiting a vertical peak load/mean load ratio, called dynamic ratio (DR), ≥3.0 to be shopped for repair. The analysis, using a 30-day repair window after detection, shows that 84% of the cars shopped for wheel(s) with DR≥3.0 required valid maintenance repairs. The minimum number of total false records (false positive + false negative records, combined) were observed within the DR range of 2.7–2.8 when considering wheel flat defects only. An analysis of the false negative records inclusive of both flat and shell spots, showed that the minimum number of false records dropped slightly to a DR range of 2.6–2.7. The reported ROC analysis shows that MNR’s current DR≥3.0 to trigger inspection and maintenance actions is reasonable.


1991 ◽  
Vol 124 (3) ◽  
pp. 295-306 ◽  
Author(s):  
A. D. Genazzani ◽  
D. Rodbard

Abstract. We utilize the "Receiver Operating Characteristic" to describe the relationship between sensitivity and specificity as the threshold for peak detection is varied systematically, to provide objective comparison of the performance of methods for detection of episodic hormonal secretion. A computer program was used to generate synthetic data with peaks with variable durations, with constant or variable height, shape and/or interpulse interval. This approach was used to compare the CLUSTER and DETECT programs. For both programs, the observed false positive rates estimated using signal-free data were in good agreement with the nominal rates, but in the presence of signal the observed false positive rates were systematically lower. Sensitivity increases with increasing signal/noise ratio, as expected. Program DETECT, using its standard options, provided excellent sensitivity (90-100%) with very low false positive rate under all conditions tested. Its performance could be further improved by the use of a more stringent definition of a peak requiring the presence of "UP" followed by a "DOWN". The CLUSTER program was found to have very poor sensitivity when using the "local variance" option. Use of the true fixed standard deviation or percent coefficient of variation resulted in a modest improvement. Optimal performance of program CLUSTER was obtained by the use of the best of 3 variance models, testing 12 different cluster sizes (from 1×1) to 4×4 and selecting the best among these: under these conditions it can achieve high sensitivity (90-100%) for very low observed false positive rate, such that its performance was comparable to that of DETECT. The methods developed and illustrated here should permit the definitive characterization and validation of the performance of any one method, the objective comparison of the relative performance of two or more methods for analysis of pulsatile hormone levels for episodic hormone secretion, and lead to the improvement of algorithms for peak detection.


2011 ◽  
Vol 42 (5) ◽  
pp. 895-898 ◽  
Author(s):  
G. Szmukler ◽  
B. Everitt ◽  
M. Leese

Risk assessment is now regarded as a necessary competence in psychiatry. The area under the curve (AUC) statistic of the receiver operating characteristic curve is increasingly offered as the main evidence for accuracy of risk assessment instruments. But, even a highly statistically significant AUC is of limited value in clinical practice.


2012 ◽  
Vol 24 (10) ◽  
pp. 2789-2824 ◽  
Author(s):  
Takashi Takenouchi ◽  
Osamu Komori ◽  
Shinto Eguchi

While most proposed methods for solving classification problems focus on minimization of the classification error rate, we are interested in the receiver operating characteristic (ROC) curve, which provides more information about classification performance than the error rate does. The area under the ROC curve (AUC) is a natural measure for overall assessment of a classifier based on the ROC curve. We discuss a class of concave functions for AUC maximization in which a boosting-type algorithm including RankBoost is considered, and the Bayesian risk consistency and the lower bound of the optimum function are discussed. A procedure derived by maximizing a specific optimum function has high robustness, based on gross error sensitivity. Additionally, we focus on the partial AUC, which is the partial area under the ROC curve. For example, in medical screening, a high true-positive rate to the fixed lower false-positive rate is preferable and thus the partial AUC corresponding to lower false-positive rates is much more important than the remaining AUC. We extend the class of concave optimum functions for partial AUC optimality with the boosting algorithm. We investigated the validity of the proposed method through several experiments with data sets in the UCI repository.


PEDIATRICS ◽  
1991 ◽  
Vol 87 (5) ◽  
pp. 670-674 ◽  
Author(s):  
David M. Jaffe ◽  
Gary R. Fleisher

This study was designed to quantify more precisely the accuracy of magnitude of rectal temperature and total white blood cell (WBC) count as indicators of bacteremia in children with an obvious focal bacterial infection. A total of 955 children, aged 3 to 36 months, who had rectal temperature ≥39.0°C and were seeking care at either of two urban pediatric emergency departments had blood drawn for culture; 885 had blood drawn for WBC count. Twenty-seven had bacteremia. Various combinations of temperature and WBC count were selected to construct receiver-operating-characteristic curves by plotting sensitivity vs false-positive rate (1 - specificity). The receiver-operating-characteristic curve of WBC count provided significantly better diagnostic information than the curve for temperature increments above 39.0°C. Each increment of 0.5°C led to large decrements in sensitivity and false-positive rates. At a WBC count cutoff of 10 000/mm3, the sensitivity was 92% while the false-positive rate was 57%. Using this cutoff point, the clinician could have avoided performing 368 of 955 blood cultures and missed only 2 of 26 children with bacteremia. Receiver-operating-characteristic curves combining WBC count and temperature increments above 39.0°C provided no better diagnostic information than that of WBC count at a temperature cutoff of 39.0°C. It is concluded that increments in temperature above 39.0°C provided additional diagnostic specificity for bacteremia only at the expense of unacceptable decreases in sensitivity. Total WBC count provided better information. A WBC count cutoff of 10 000/mm3 increased specificity with minimal decrease in sensitivity. Receiver-operating-characteristic curve analysis allows selection of cutoff criteria by individual practitioners based on the prevalence of bacteremia in their communities and on the perceived risks of bacteremia.


Breast cancer is one of the most widely recognized tumors globally among ladies with the data available that one of every eight ladies is influenced by this illness during their lifetime. Mammography is the best imaging methodology for early location of the disease in beginning times. On account of poor complexity and low perceivability in the mammographic pictures, early discovery of the cancer malignant growth is a huge challenge to effective cure of the disease. Distinctive CAD (computer aided detection) supported algorithms have been developed to enable radiologists to give an exact determination. This paper highlights the study of the most widely recognized methodologies of image segmentation created for recognition of calcifications and masses. The principle focal point of this survey is on picture theof strategies and the factors utilized for early bosom disease identification. Surface investigation is the vital advance in any picture division strategies of image segmentation which depend on a nearby spatial variety of color or shading. Subsequently, different techniques for texture investigation for small scale calcification and mass identification in mammography are talked about in the mechanism of mammography. The point of this paper is to audit existing ways to deal with the segmentation of masses and automated detection in mammographic pictures, underlining the key-focuses and primary contrasts among the utilized systems. The key goal is to bring up the preferences and drawbacks of the different methodologies. Conversely with different surveys which just portray and think about various methodologies subjectively, this audit likewise gives a quantifiable examination.In proposed research use deep learning base network for classification of mammography images . In previous approaches use machine learning base learning. The Main drawback of machine learning is selection of features manualy or by functions but in deep learning automatic feature detect and its vary according to image. The demonstration of seven mass recognition techniques is thought about utilizing two distinctive databases of mammography: an open digitized database and a full-field (local) advanced digitized database. The outcomes are given as far as Free reaction Receiver Operating Characteristic (FROC) and Receiver Operating Characteristic (ROC) examination.


Author(s):  
Kathrin Dolle ◽  
Gerd Schulte-Körne ◽  
Nikolaus von Hofacker ◽  
Yonca Izat ◽  
Antje-Kathrin Allgaier

Fragestellung: Die vorliegende Studie untersucht die Übereinstimmung von strukturierten Kind- und Elterninterviews sowie dem klinischen Urteil bei der Diagnostik depressiver Episoden im Kindes- und Jugendalter. Zudem prüft sie, ob sich die Treffsicherheit und die optimalen Cut-off-Werte von Selbstbeurteilungsfragebögen in Referenz zu diesen verschiedenen Beurteilerperspektiven unterscheiden. Methodik: Mit 81 Kindern (9–12 Jahre) und 88 Jugendlichen (13–16 Jahre), die sich in kinder- und jugendpsychiatrischen Kliniken oder Praxen vorstellten, und ihren Eltern wurden strukturierte Kinder-DIPS-Interviews durchgeführt. Die Kinder füllten das Depressions-Inventar für Kinder und Jugendliche (DIKJ) aus, die Jugendlichen die Allgemeine Depressions-Skala in der Kurzform (ADS-K). Übereinstimmungen wurden mittels Kappa-Koeffizienten ermittelt. Optimale Cut-off-Werte, Sensitivität, Spezifität sowie positive und negative prädiktive Werte wurden anhand von Receiver operating characteristic (ROC) Kurven bestimmt. Ergebnisse: Die Interviews stimmten untereinander sowie mit dem klinischen Urteil niedrig bis mäßig überein. Depressive Episoden wurden häufiger nach klinischem Urteil als in den Interviews festgestellt. Cut-off-Werte und Validitätsmaße der Selbstbeurteilungsfragebögen variierten je nach Referenzstandard mit den schlechtesten Ergebnissen für das klinische Urteil. Schlussfolgerungen: Klinische Beurteiler könnten durch den Einsatz von strukturierten Interviews profitieren. Strategien für den Umgang mit diskrepanten Kind- und Elternangaben sollten empirisch geprüft und detailliert beschrieben werden.


Sign in / Sign up

Export Citation Format

Share Document