scholarly journals Kappa Coefficients for Missing Data

2019 ◽  
Vol 79 (3) ◽  
pp. 558-576 ◽  
Author(s):  
Alexandra De Raadt ◽  
Matthijs J. Warrens ◽  
Roel J. Bosker ◽  
Henk A. L. Kiers

Cohen’s kappa coefficient is commonly used for assessing agreement between classifications of two raters on a nominal scale. Three variants of Cohen’s kappa that can handle missing data are presented. Data are considered missing if one or both ratings of a unit are missing. We study how well the variants estimate the kappa value for complete data under two missing data mechanisms—namely, missingness completely at random and a form of missingness not at random. The kappa coefficient considered in Gwet ( Handbook of Inter-rater Reliability, 4th ed.) and the kappa coefficient based on listwise deletion of units with missing ratings were found to have virtually no bias and mean squared error if missingness is completely at random, and small bias and mean squared error if missingness is not at random. Furthermore, the kappa coefficient that treats missing ratings as a regular category appears to be rather heavily biased and has a substantial mean squared error in many of the simulations. Because it performs well and is easy to compute, we recommend to use the kappa coefficient that is based on listwise deletion of missing ratings if it can be assumed that missingness is completely at random or not at random.

2014 ◽  
Vol 2014 ◽  
pp. 1-9 ◽  
Author(s):  
Matthijs J. Warrens

Cohen’s kappa is a widely used association coefficient for summarizing interrater agreement on a nominal scale. Kappa reduces the ratings of the two observers to a single number. With three or more categories it is more informative to summarize the ratings by category coefficients that describe the information for each category separately. Examples of category coefficients are the sensitivity or specificity of a category or the Bloch-Kraemer weighted kappa. However, in many research studies one is often only interested in a single overall number that roughly summarizes the agreement. It is shown that both the overall observed agreement and Cohen’s kappa are weighted averages of various category coefficients and thus can be used to summarize these category coefficients.


Author(s):  
Julián Guzmán-Fierro ◽  
Sharel Charry ◽  
Ivan González ◽  
Felipe Peña-Heredia ◽  
Nathalie Hernández ◽  
...  

Abstract This paper presents a methodology based on Bayesian Networks (BN) to prioritise and select the minimal number of variables that allows predicting the structural condition of sewer assets to support the strategies in proactive management. The integration of BN models, statistical measures of agreement (Cohen's Kappa coefficient) and a statistical test (Wilcoxon test) were useful for a robust and straightforward selection of a minimum number of variables (qualitative and quantitative) that ensure a suitable prediction level of the structural conditions of sewer pipes. According to the application of the methodology to a specific case study (Bogotás sewer network, Colombia), it found that with only two variables (age and diameter) the model could achieve the same capacity of prediction (Cohen's Kappa coefficient = 0.43) as a model considering several variables. Furthermore, the methodology allows finding the calibration and validation percentage subsets that best fit (80% for calibration and 20% for validation data in the case study) in the model to increase the capacity of prediction with low variations. Furthermore, it found that a model, considering only pipes in critical and excellent conditions, increases the capacity of successful predictions (Cohen's Kappa coefficient from 0.2 to 0.43) for the proposed case study.


2020 ◽  
pp. oemed-2020-106658
Author(s):  
Mahée Gilbert-Ouimet ◽  
Xavier Trudel ◽  
Karine Aubé ◽  
Ruth Ndjaboue ◽  
Caroline S Duchaine ◽  
...  

ObjectivesThis study assesses the validity of a self-reported mental health problem (MHP) diagnosis as the reason for a work absence of 5 days or more compared with a physician-certified MHP diagnosis related to the same work absence. The potential modifying effect of absence duration on validity is also examined.MethodsA total of 709 participants (1031 sickness absence episodes) were selected and interviewed. Total per cent agreement, Cohen’s kappa, sensitivity and specificity values were calculated using the physician-certified MHP diagnosis related to a given work absence as the reference standard. Stratified analyses of total agreement, sensitivity and specificity values were also examined by duration of work absence (5–20 workdays,>20 workdays).ResultsTotal agreement value for self-reported MHP was 90%. Cohen’s kappa value was substantial (0.74). Sensitivity was 77% and specificity was 95%. Absences of more than 20 workdays had a better sensitivity than absences of shorter duration. A high specificity was observed for both short and longer absence episodes.ConclusionThis study showed high specificity and good sensitivity of self-reported MHP diagnosis compared with physician-certified MHP diagnosis for the same work absence. Absences of longer durations had a better sensitivity.


2018 ◽  
Author(s):  
Cailey Elizabeth Fitzgerald ◽  
Ryne Estabrook ◽  
Daniel Patrick Martin ◽  
Andreas Markus Brandmaier ◽  
Timo von Oertzen

Missing data are ubiquitous in both small and large datasets. Missing data may come about as a result of coding or computer error, participant absences, or it may be intentional, as in planned missing designs. We discuss missing data as it relates to goodness-of-fit indices in Structural Equation Modeling (SEM), specifically the effects of missing data on the Root Mean Squared Error of Approximation (RMSEA). We use simulations to show that naive implementations of the RMSEA have a downward bias in the presence of missing data and, thus, overestimate model goodness-of-fit. Unfortunately, many state-of-the-art software packages report the biased form of RMSEA. As a consequence, the community may have been accepting a much larger fraction of models with non-acceptable model fit. We propose a bias-correction for the RMSEA based on information-theoretic considerations that take into account the expected misfit of a person with fully observed data. This results in an RMSEA which is asymptotically independent of the proportion of missing data for misspecified models. Importantly, results of the corrected RMSEA computation are identical to naive RMSEA if there are no missing data.


2021 ◽  
Author(s):  
Yanjun LI ◽  
Xianglin Yang ◽  
Zhi Xu ◽  
Yu Zhang ◽  
Zhongping Cao

Abstract The sleep monitoring with PSG severely degrades the sleep quality. In order to simplify the hygienic processing and reduce the load of sleep monitoring, an approach to automatic sleep stage classification without electroencephalogram (EEG) was explored. Totally 108 features from two-channel electrooculogram (EOG) and 6 features from one-channel electromyogram (EMG) were extracted. After feature normalization, the random forest (RF) was used to classify five stages, including wakefulness, REM sleep, N1 sleep, N2 sleep and N3 sleep. Using 114 normalized features from the combination of EOG (108 features) and EMG (6 features), the Cohen’s kappa coefficient was 0.749 and the accuracy was 80.8% by leave-one -out cross-validation (LOOCV) for 124 records from ISRUC-Sleep. As a reference for AASM standard, the Cohen’s kappa coefficient was 0.801 and the accuracy was 84.7% for the same dataset based on 438 normalized features from the combination of EEG (324 features), EOG (108 features) and EMG (6 features). In conclusion, the approach by EOG+EMG with the normalization can reduce the load of sleep monitoring, and achieves comparable performances with the "gold standard" EEG+EOG+EMG on sleep classification.


ACI Open ◽  
2019 ◽  
Vol 03 (02) ◽  
pp. e88-e97
Author(s):  
Mohammadamin Tajgardoon ◽  
Malarkodi J. Samayamuthu ◽  
Luca Calzoni ◽  
Shyam Visweswaran

Abstract Background Machine learning models that are used for predicting clinical outcomes can be made more useful by augmenting predictions with simple and reliable patient-specific explanations for each prediction. Objectives This article evaluates the quality of explanations of predictions using physician reviewers. The predictions are obtained from a machine learning model that is developed to predict dire outcomes (severe complications including death) in patients with community acquired pneumonia (CAP). Methods Using a dataset of patients diagnosed with CAP, we developed a predictive model to predict dire outcomes. On a set of 40 patients, who were predicted to be either at very high risk or at very low risk of developing a dire outcome, we applied an explanation method to generate patient-specific explanations. Three physician reviewers independently evaluated each explanatory feature in the context of the patient's data and were instructed to disagree with a feature if they did not agree with the magnitude of support, the direction of support (supportive versus contradictory), or both. Results The model used for generating predictions achieved a F1 score of 0.43 and area under the receiver operating characteristic curve (AUROC) of 0.84 (95% confidence interval [CI]: 0.81–0.87). Interreviewer agreement between two reviewers was strong (Cohen's kappa coefficient = 0.87) and fair to moderate between the third reviewer and others (Cohen's kappa coefficient = 0.49 and 0.33). Agreement rates between reviewers and generated explanations—defined as the proportion of explanatory features with which majority of reviewers agreed—were 0.78 for actual explanations and 0.52 for fabricated explanations, and the difference between the two agreement rates was statistically significant (Chi-square = 19.76, p-value < 0.01). Conclusion There was good agreement among physician reviewers on patient-specific explanations that were generated to augment predictions of clinical outcomes. Such explanations can be useful in interpreting predictions of clinical outcomes.


2021 ◽  
Vol 80 (Suppl 1) ◽  
pp. 983.2-983
Author(s):  
B. Drude ◽  
Ø. Maugesten ◽  
S. G. Werner ◽  
G. R. Burmester ◽  
J. Berger ◽  
...  

Background:Fluorescence Optical Imaging (FOI) utilises the fluorophore indocyanine green (ICG) to reflect enhanced microcirculation in hand and finger joints due to inflammation.Objectives:We wanted to assess the interreader reliability of FOI enhancement in patients with hand osteoarthritis (OA) and psoriatic arthritis (PsA). Furthermore, predefined typical morphologic patterns were included to determine the ability of FOI to discriminate between both diagnoses.Methods:An atlas with example images of grade 0-3 in different joint groups and typical morphologic patterns (‘streaky signals’[1], ‘green/blue nail sign’[2], ‘Werner sign’[3,4], and ‘Bishop’s crozier sign’) of PsA and hand OA was created. Two readers scored all joints in both hands (30 in total) of 20 cases with hand OA and PsA. The cases were randomly mixed and both readers were blinded to diagnosis. Each joint was rated on a semiquantitative scale from 0 to 3 in five different images (PrimaVista Mode (PVM), phase 1, 2 (first and middle image), and 3) during the FOI sequence according to the scoring method FOIAS (fluorescence optical imaging activity score)[1,3]. Interreader reliability on scoring joint enhancement was calculated using linear weighted Cohen’s kappa (κ). Agreement on diagnosis (hand OA vs. PsA) and different morphologic patterns was assessed by calculating (regular) Cohen’s kappa.Results:Overall agreement on scoring joint enhancement (all phases) was substantial (κ = 0.75), with greatest consensus in phase 2 first (κ = 0.75) and lowest agreement in phase 1 (κ = 0.46). Reliability varied in different joint groups (wrist, MCP, (P)IP, DIP), with almost perfect overall agreement on PIP joint affection (κ = 0.81), substantial agreement on wrist (κ = 0.69) and DIP joint affection (κ = 0.63), and moderate agreement on MCP joint affection (κ = 0.49) across all phases. Consensus on morphologic patterns showed overall fair agreement (κ = 0.37) with a similar kappa value on the ability to discriminate between both diagnoses (κ = 0.3).Conclusion:Joint enhancement in FOI can be reliably assessed using a predefined scoring method. The ability of FOI to differentiate between hand OA and PsA seems to be limited. Clearer definition and more training might be needed to better agree on morphologic patterns in FOI.References:[1] Glimm AM, Werner SG, Burmester GR, et al. Ann Rheum Dis. 2016 Mar;75(3):566-570[2] Wiemann O, Werner SG, Langer HE, et al. J Dtsch Dermatol Ges. 2019 Feb;17(2):138-148[3] Werner SG, Langer HE, Ohrndorf S, et al. Ann Rheum Dis. 2012 Apr;71(4):504-510[4] Zeidler H 2019. Fluoreszenzoptische Bildgebung. In: Zeidler H, Michel BA. Differenzialdiagnose rheumatischer Erkrankungen 5. Aufl. Springer, Heidelberg, S. 88-89Disclosure of Interests:Benedict Drude: None declared, Øystein Maugesten: None declared, Stephanie Gabriele Werner: None declared, Gerd Rüdiger Burmester: None declared, Jörn Berger Employee of: Xiralite GmbH, Ida K. Haugen: None declared, Sarah Ohrndorf: None declared


2012 ◽  
Vol 2012 ◽  
pp. 1-11
Author(s):  
Matthijs J. Warrens

Cohen’s kappa is a popular descriptive statistic for summarizing agreement between the classifications of two raters on a nominal scale. With m≥3 raters there are several views in the literature on how to define agreement. The concept of g-agreement (g∈{2,3,…,m}) refers to the situation in which it is decided that there is agreement if g out of m raters assign an object to the same category. Given m≥2 raters we can formulate m−1 multirater kappas, one based on 2-agreement, one based on 3-agreement, and so on, and one based on m-agreement. It is shown that if the scale consists of only two categories the multi-rater kappas based on 2-agreement and 3-agreement are identical.


2021 ◽  
Vol 9 ◽  
Author(s):  
Pellegrino Cerino ◽  
Alfonso Gallo ◽  
Biancamaria Pierri ◽  
Carlo Buonerba ◽  
Denise Di Concilio ◽  
...  

The onset of the new SARS-CoV-2 coronavirus encouraged the development of new serologic tests that could be additional and complementary to real-time RT-PCR-based assays. In such a context, the study of performances of available tests is urgently needed, as their use has just been initiated for seroprevalence assessment. The aim of this study was to compare four chemiluminescence immunoassays and one immunochromatography test for SARS-Cov-2 antibodies for the evaluation of the degree of diffusion of SARS-CoV-2 infection in Salerno Province (Campania Region, Italy). A total of 3,185 specimens from citizens were tested for anti-SARS-CoV-2 antibodies as part of a screening program. Four automated immunoassays (Abbott and Liaison SARS-CoV-2 CLIA IgG and Roche and Siemens SARS-CoV-2 CLIA IgM/IgG/IgA assays) and one lateral flow immunoassay (LFIA Technogenetics IgG–IgM COVID-19) were used. Seroprevalence in the entire cohort was 2.41, 2.10, 1.82, and 1.85% according to the Liaison IgG, Abbott IgG, Siemens, and Roche total Ig tests, respectively. When we explored the agreement among the rapid tests and the serologic assays, we reported good agreement for Abbott, Siemens, and Roche (Cohen's Kappa coefficient 0.69, 0.67, and 0.67, respectively), whereas we found moderate agreement for Liaison (Cohen's kappa coefficient 0.58). Our study showed that Abbott and Liaison SARS-CoV-2 CLIA IgG, Roche and Siemens SARS-CoV-2 CLIA IgM/IgG/IgA assays, and LFIA Technogenetics IgG-IgM COVID-19 have good agreement in seroprevalence assessment. In addition, our findings indicate that the prevalence of IgG and total Ig antibodies against SARS-CoV-2 at the time of the study was as low as around 3%, likely explaining the amplitude of the current second wave.


Sign in / Sign up

Export Citation Format

Share Document