Relationships of Cohen's Kappa, Sensitivity, and Specificity for Unbiased Annotations

Author(s):  
Juan Wang ◽  
Bin Xia
2020 ◽  
pp. oemed-2020-106658
Author(s):  
Mahée Gilbert-Ouimet ◽  
Xavier Trudel ◽  
Karine Aubé ◽  
Ruth Ndjaboue ◽  
Caroline S Duchaine ◽  
...  

ObjectivesThis study assesses the validity of a self-reported mental health problem (MHP) diagnosis as the reason for a work absence of 5 days or more compared with a physician-certified MHP diagnosis related to the same work absence. The potential modifying effect of absence duration on validity is also examined.MethodsA total of 709 participants (1031 sickness absence episodes) were selected and interviewed. Total per cent agreement, Cohen’s kappa, sensitivity and specificity values were calculated using the physician-certified MHP diagnosis related to a given work absence as the reference standard. Stratified analyses of total agreement, sensitivity and specificity values were also examined by duration of work absence (5–20 workdays,>20 workdays).ResultsTotal agreement value for self-reported MHP was 90%. Cohen’s kappa value was substantial (0.74). Sensitivity was 77% and specificity was 95%. Absences of more than 20 workdays had a better sensitivity than absences of shorter duration. A high specificity was observed for both short and longer absence episodes.ConclusionThis study showed high specificity and good sensitivity of self-reported MHP diagnosis compared with physician-certified MHP diagnosis for the same work absence. Absences of longer durations had a better sensitivity.


Children ◽  
2021 ◽  
Vol 8 (8) ◽  
pp. 659
Author(s):  
Ioana Mihaiela Ciuca ◽  
Mihaela Dediu ◽  
Monica Steluta Marc ◽  
Mirabela Lukic ◽  
Delia Ioana Horhat ◽  
...  

Background: Pneumonia is the leading cause of death among children; thus, a correct early diagnosis would be ideal. The imagistic diagnosis still uses chest X-ray (CXR), but lung ultrasound (LUS) proves to be reliable for pneumonia diagnosis. The aim of our study was to evaluate the sensitivity and specificity of LUS compared to CXR in consolidated pneumonia. Methods: Children with clinical suspicion of bacterial pneumonia were screened by LUS for pneumonia, followed by CXR. The agreement relation between LUS and CXR regarding the detection of consolidation was evaluated by Cohen’s kappa test. Results: A total of 128 patients with clinical suspicion of pneumonia were evaluated; 74 of them were confirmed by imagery and biological inflammatory markers. The highest frequency of pneumonia was in the 0–3 years age group (37.83%). Statistical estimation of the agreement between LUS and CXR in detection of the consolidation found an almost perfect agreement, with a Cohen’s kappa coefficient of K = 0.89 ± 0.04 SD, p = 0.000. Sensitivity of LUS was superior to CXR in detection of consolidations. Conclusion: Lung ultrasound is a reliable method for the detection of pneumonia consolidation in hospitalized children, with sensitivity and specificity superior to CXR. LUS should be used for rapid and safe evaluation of child pneumonia.


PLoS ONE ◽  
2020 ◽  
Vol 15 (12) ◽  
pp. e0244023
Author(s):  
Darunee Chotiprasitsakul ◽  
Pataraporn Pewloungsawat ◽  
Chavachol Setthaudom ◽  
Pitak Santanirand ◽  
Prapaporn Pornsuriyasak

Background PCR is more sensitive than immunofluorescence assay (IFA) for detection of Pneumocystis jirovecii. However, PCR cannot always distinguish infection from colonization. This study aimed to compare the performance of real-time PCR and IFA for diagnosis of P. jirovecii pneumonia (PJP) in a real-world clinical setting. Methods A retrospective cohort study was conducted at a 1,300-bed hospital between April 2017 and December 2018. Patients whose respiratory sample (bronchoalveolar lavage or sputum) were tested by both Pneumocystis PCR and IFA were included. Diagnosis of PJP was classified based on multicomponent criteria. Sensitivity, specificity, 95% confidence intervals (CI), and Cohen's kappa coefficient were calculated. Results There were 222 eligible patients. The sensitivity and specificity of PCR was 91.9% (95% CI, 84.0%–96.7%) and 89.7% (95% CI, 83.3%–94.3%), respectively. The sensitivity and specificity of IFA was 7.0% (95% CI, 2.6%–14.6%) and 99.2% (95% CI, 95.6%–100.0%), respectively. The percent agreement between PCR and IFA was 56.7% (Cohen's kappa -0.02). Among discordant PCR-positive and IFA-negative samples, 78% were collected after PJP treatment. Clinical management would have changed in 14% of patients using diagnostic information, mainly based on PCR results. Conclusions PCR is highly sensitive compared with IFA for detection of PJP. Combining clinical, and radiological features with PCR is useful for diagnosis of PJP, particularly when respiratory specimens cannot be promptly collected before initiation of PJP treatment.


2021 ◽  
Vol 9 (A) ◽  
pp. 802-810
Author(s):  
Ghada Ismail ◽  
Rania Abdel Halim ◽  
Marwa Salah Mostafa ◽  
Dalia H Abdelhamid ◽  
Hossam Abdelghaffar ◽  
...  

Background To date, the molecular assay is the gold-standard method for COVID-19 diagnosis. However, they are expensive and complex. There is a pressing necessity for developing other effective diagnostics for SARS‐CoV‐2 patients. Therefore, serological detection of antibodies against SARS‐CoV‐2 might provide a good alternative. Aim We aimed to compare and evaluate seven rapid diagnostic tests with Mindray chemiluminescent automated immunoassay as a reference method for SARS-CoV-2 antibodies detection. Methods: This study included the serum of a total of 49 attendees to the Reference Laboratory of Egyptian university hospitals during the period from April 2021 to May 2021. Anti-Covid-19 antibodies detection in serum samples was performed by Mindray fully automated system as our reference method and seven rapid antibody tests; Wondfo, Vazyme, Dynamiker, Panbio, Artron Maccura and Roche. Results: The chemiluminescent assay revealed 30 (61.2%) positive samples and 19 (38.8%) negative samples for COVID-19 IgG. For COVID-19 IgM, 11 (22.4%) samples were positive and 38 (77.6%) samples were negative. Anti-SARS-CoV-2 antibodies were not detected in any of the PCR negative individuals. The best diagnostic performance was demonstrated by Roche IgG and IgM, and Vazyme IgG and IgM antibody tests followed by Panbio. For Roche, the sensitivity and specificity for IgG and IgM were (83.3%, 89.5%) and (72.7%, 81.6%) respectively. Vazyme showed sensitivity and specificity for IgG and IgM were (77.8%, 85.7%) and (75%, 91.7%) respectively. Regards Panbio, the sensitivity and specificity for IgG and IgM were (63.6%, 87.5%) and (50%, 86.7%) respectively. Cohen’s Kappa values revealed a substantial agreement for Roche IgG, Vazyme IgG and IgM of (0.7076, 0.6250, 0.6667) respectively. The worst agreement was reported for Maccura IgG, Wondfo, and Dynamiker IgM with Cohen’s Kappa values of (0.2508, 0.1893, 0.0313) respectively. Conclusions: Rapid tests in our study exhibited heterogeneous diagnostic performances. Roche, Vazyme, and Panbio antibody tests showed promising results in concordance with our reference method with the best-reported results. On the other hand, the other tests were inferior and failed in providing valid and reliable results. Further studies are necessary to determine the practicality of these tests in different settings and communities.


Author(s):  
Miriam Athmann ◽  
Roya Bornhütter ◽  
Nicolaas Busscher ◽  
Paul Doesburg ◽  
Uwe Geier ◽  
...  

AbstractIn the image forming methods, copper chloride crystallization (CCCryst), capillary dynamolysis (CapDyn), and circular chromatography (CChrom), characteristic patterns emerge in response to different food extracts. These patterns reflect the resistance to decomposition as an aspect of resilience and are therefore used in product quality assessment complementary to chemical analyses. In the presented study, rocket lettuce from a field trial with different radiation intensities, nitrogen supply, biodynamic, organic and mineral fertilization, and with or without horn silica application was investigated with all three image forming methods. The main objective was to compare two different evaluation approaches, differing in the type of image forming method leading the evaluation, the amount of factors analyzed, and the deployed perceptual strategy: Firstly, image evaluation of samples from all four experimental factors simultaneously by two individual evaluators was based mainly on analyzing structural features in CapDyn (analytical perception). Secondly, a panel of eight evaluators applied a Gestalt evaluation imbued with a kinesthetic engagement of CCCryst patterns from either fertilization treatments or horn silica treatments, followed by a confirmatory analysis of individual structural features. With the analytical approach, samples from different radiation intensities and N supply levels were identified correctly in two out of two sample sets with groups of five samples per treatment each (Cohen’s kappa, p = 0.0079), and the two organic fertilizer treatments were differentiated from the mineral fertilizer treatment in eight out of eight sample sets with groups of three manure and two minerally fertilized samples each (Cohen’s kappa, p = 0.0048). With the panel approach based on Gestalt evaluation, biodynamic fertilization was differentiated from organic and mineral fertilization in two out of two exams with 16 comparisons each (Friedman test, p < 0.001), and samples with horn silica application were successfully identified in two out of two exams with 32 comparisons each (Friedman test, p < 0.001). Further research will show which properties of the food decisive for resistance to decomposition are reflected by analytical and Gestalt criteria, respectively, in CCCryst and CapDyn images.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Alexandre Maciel-Guerra ◽  
Necati Esener ◽  
Katharina Giebel ◽  
Daniel Lea ◽  
Martin J. Green ◽  
...  

AbstractStreptococcus uberis is one of the leading pathogens causing mastitis worldwide. Identification of S. uberis strains that fail to respond to treatment with antibiotics is essential for better decision making and treatment selection. We demonstrate that the combination of supervised machine learning and matrix-assisted laser desorption ionization/time of flight (MALDI-TOF) mass spectrometry can discriminate strains of S. uberis causing clinical mastitis that are likely to be responsive or unresponsive to treatment. Diagnostics prediction systems trained on 90 individuals from 26 different farms achieved up to 86.2% and 71.5% in terms of accuracy and Cohen’s kappa. The performance was further increased by adding metadata (parity, somatic cell count of previous lactation and count of positive mastitis cases) to encoded MALDI-TOF spectra, which increased accuracy and Cohen’s kappa to 92.2% and 84.1% respectively. A computational framework integrating protein–protein networks and structural protein information to the machine learning results unveiled the molecular determinants underlying the responsive and unresponsive phenotypes.


Author(s):  
Maximilian Lutz ◽  
Martin Möckel ◽  
Tobias Lindner ◽  
Christoph J. Ploner ◽  
Mischa Braun ◽  
...  

Abstract Background Management of patients with coma of unknown etiology (CUE) is a major challenge in most emergency departments (EDs). CUE is associated with a high mortality and a wide variety of pathologies that require differential therapies. A suspected diagnosis issued by pre-hospital emergency care providers often drives the first approach to these patients. We aim to determine the accuracy and value of the initial diagnostic hypothesis in patients with CUE. Methods Consecutive ED patients presenting with CUE were prospectively enrolled. We obtained the suspected diagnoses or working hypotheses from standardized reports given by prehospital emergency care providers, both paramedics and emergency physicians. Suspected and final diagnoses were classified into I) acute primary brain lesions, II) primary brain pathologies without acute lesions and III) pathologies that affected the brain secondarily. We compared suspected and final diagnosis with percent agreement and Cohen’s Kappa including sub-group analyses for paramedics and physicians. Furthermore, we tested the value of suspected and final diagnoses as predictors for mortality with binary logistic regression models. Results Overall, suspected and final diagnoses matched in 62% of 835 enrolled patients. Cohen’s Kappa showed a value of κ = .415 (95% CI .361–.469, p < .005). There was no relevant difference in diagnostic accuracy between paramedics and physicians. Suspected diagnoses did not significantly interact with in-hospital mortality (e.g., suspected class I: OR .982, 95% CI .518–1.836) while final diagnoses interacted strongly (e.g., final class I: OR 5.425, 95% CI 3.409–8.633). Conclusion In cases of CUE, the suspected diagnosis is unreliable, regardless of different pre-hospital care providers’ qualifications. It is not an appropriate decision-making tool as it neither sufficiently predicts the final diagnosis nor detects the especially critical comatose patient. To avoid the risk of mistriage and unnecessarily delayed therapy, we advocate for a standardized diagnostic work-up for all CUE patients that should be triggered by the emergency symptom alone and not by any suspected diagnosis.


2021 ◽  
Vol 11 (6) ◽  
pp. 2723
Author(s):  
Fatih Uysal ◽  
Fırat Hardalaç ◽  
Ozan Peker ◽  
Tolga Tolunay ◽  
Nil Tokgöz

Fractures occur in the shoulder area, which has a wider range of motion than other joints in the body, for various reasons. To diagnose these fractures, data gathered from X-radiation (X-ray), magnetic resonance imaging (MRI), or computed tomography (CT) are used. This study aims to help physicians by classifying shoulder images taken from X-ray devices as fracture/non-fracture with artificial intelligence. For this purpose, the performances of 26 deep learning-based pre-trained models in the detection of shoulder fractures were evaluated on the musculoskeletal radiographs (MURA) dataset, and two ensemble learning models (EL1 and EL2) were developed. The pre-trained models used are ResNet, ResNeXt, DenseNet, VGG, Inception, MobileNet, and their spinal fully connected (Spinal FC) versions. In the EL1 and EL2 models developed using pre-trained models with the best performance, test accuracy was 0.8455, 0.8472, Cohen’s kappa was 0.6907, 0.6942 and the area that was related with fracture class under the receiver operating characteristic (ROC) curve (AUC) was 0.8862, 0.8695. As a result of 28 different classifications in total, the highest test accuracy and Cohen’s kappa values were obtained in the EL2 model, and the highest AUC value was obtained in the EL1 model.


Author(s):  
Calli Ostrofsky ◽  
Jaishika Seedat

Background: Notwithstanding its value, there are challenges and limitations to implementing a dysphagia screening tool from a developed contexts in a developing context. The need for a reliable and valid screening tool for dysphagia that considers context, systemic rules and resources was identified to prevent further medical compromise, optimise dysphagia prognosis and ultimately hasten patients’ return to home or work.Methodology: To establish the validity and reliability of the South African dysphagia screening tool (SADS) for acute stroke patients accessing government hospital services. The study was a quantitative, non-experimental, correlational cross-sectional design with a retrospective component. Convenient sampling was used to recruit 18 speech-language therapists and 63 acute stroke patients from three South African government hospitals. The SADS consists of 20 test items and was administered by speech-language therapists. Screening was followed by a diagnostic dysphagia assessment. The administrator of the tool was not involved in completing the diagnostic assessment, to eliminate bias and prevent contamination of results from screener to diagnostic assessment. Sensitivity, validity and efficacy of the screening tool were evaluated against the results of the diagnostic dysphagia assessment. Cohen’s kappa measures determined inter-rater agreement between the results of the SADS and the diagnostic assessment.Results and conclusion: The SADS was proven to be valid and reliable. Cohen’s kappa indicated a high inter-rater reliability and showed high sensitivity and adequate specificity in detecting dysphagia amongst acute stroke patients who were at risk for dysphagia. The SADS was characterised by concurrent, content and face validity. As a first step in establishing contextual appropriateness, the SADS is a valid and reliable screening tool that is sensitive in identifying stroke patients at risk for dysphagia within government hospitals in South Africa.


Stroke ◽  
2021 ◽  
Author(s):  
Maximilian Nielsen ◽  
Moritz Waldmann ◽  
Andreas M. Frölich ◽  
Fabian Flottmann ◽  
Evelin Hristova ◽  
...  

Background and Purpose: Mechanical thrombectomy is an established procedure for treatment of acute ischemic stroke. Mechanical thrombectomy success is commonly assessed by the Thrombolysis in Cerebral Infarction (TICI) score, assigned by visual inspection of X-ray digital subtraction angiography data. However, expert-based TICI scoring is highly observer-dependent. This represents a major obstacle for mechanical thrombectomy outcome comparison in, for instance, multicentric clinical studies. Focusing on occlusions of the M1 segment of the middle cerebral artery, the present study aimed to develop a deep learning (DL) solution to automated and, therefore, objective TICI scoring, to evaluate the agreement of DL- and expert-based scoring, and to compare corresponding numbers to published scoring variability of clinical experts. Methods: The study comprises 2 independent datasets. For DL system training and initial evaluation, an in-house dataset of 491 digital subtraction angiography series and modified TICI scores of 236 patients with M1 occlusions was collected. To test the model generalization capability, an independent external dataset with 95 digital subtraction angiography series was analyzed. Characteristics of the DL system were modeling TICI scoring as ordinal regression, explicit consideration of the temporal image information, integration of physiological knowledge, and modeling of inherent TICI scoring uncertainties. Results: For the in-house dataset, the DL system yields Cohen’s kappa, overall accuracy, and specific agreement values of 0.61, 71%, and 63% to 84%, respectively, compared with the gold standard: the expert rating. Values slightly drop to 0.52/64%/43% to 87% when the model is, without changes, applied to the external dataset. After model updating, they increase to 0.65/74%/60% to 90%. Literature Cohen’s kappa values for expert-based TICI scoring agreement are in the order of 0.6. Conclusions: The agreement of DL- and expert-based modified TICI scores in the range of published interobserver variability of clinical experts highlights the potential of the proposed DL solution to automated TICI scoring.


Sign in / Sign up

Export Citation Format

Share Document