scholarly journals A Comparison of Reliability Coefficients for Ordinal Rating Scales

Author(s):  
Alexandra de Raadt ◽  
Matthijs J. Warrens ◽  
Roel J. Bosker ◽  
Henk A. L. Kiers

AbstractKappa coefficients are commonly used for quantifying reliability on a categorical scale, whereas correlation coefficients are commonly applied to assess reliability on an interval scale. Both types of coefficients can be used to assess the reliability of ordinal rating scales. In this study, we compare seven reliability coefficients for ordinal rating scales: the kappa coefficients included are Cohen’s kappa, linearly weighted kappa, and quadratically weighted kappa; the correlation coefficients included are intraclass correlation ICC(3,1), Pearson’s correlation, Spearman’s rho, and Kendall’s tau-b. The primary goal is to provide a thorough understanding of these coefficients such that the applied researcher can make a sensible choice for ordinal rating scales. A second aim is to find out whether the choice of the coefficient matters. We studied to what extent we reach the same conclusions about inter-rater reliability with different coefficients, and to what extent the coefficients measure agreement in a similar way, using analytic methods, and simulated and empirical data. Using analytical methods, it is shown that differences between quadratic kappa and the Pearson and intraclass correlations increase if agreement becomes larger. Differences between the three coefficients are generally small if differences between rater means and variances are small. Furthermore, using simulated and empirical data, it is shown that differences between all reliability coefficients tend to increase if agreement between the raters increases. Moreover, for the data in this study, the same conclusion about inter-rater reliability was reached in virtually all cases with the four correlation coefficients. In addition, using quadratically weighted kappa, we reached a similar conclusion as with any correlation coefficient a great number of times. Hence, for the data in this study, it does not really matter which of these five coefficients is used. Moreover, the four correlation coefficients and quadratically weighted kappa tend to measure agreement in a similar way: their values are very highly correlated for the data in this study.

2014 ◽  
Vol 114 (1) ◽  
pp. 93-103 ◽  
Author(s):  
Tomas Larson ◽  
Eva Norén Selinus ◽  
Clara Hellner Gumpert ◽  
Thomas Nilsson ◽  
Nóra Kerekes ◽  
...  

The Autism-Tics, AD/HD, and other Comorbidities (A–TAC) inventory is used in epidemiological research to assess neurodevelopmental problems and coexisting conditions. Although the A–TAC has been applied in various populations, data on retest reliability are limited. The objective of the present study was to present additional reliability data. The A–TAC was administered by lay assessors and was completed on two occasions by parents of 400 individual twins, with an average interval of 70 days between test sessions. Intra- and inter-rater reliability were analysed with intraclass correlations and Cohen's κ. A–TAC showed excellent test-retest intraclass correlations for both autism spectrum disorder and attention deficit hyperactivity disorder (each at .84). Most modules in the A–TAC had intra- and inter-rater reliability intraclass correlation coefficients of ≥ .60. Cohen's κ indicated acceptable reliability. The current study provides statistical evidence that the A–TAC yields good test-retest reliability in a population-based cohort of children.


2020 ◽  
pp. bmjstel-2020-000705
Author(s):  
Benjamin Clarke ◽  
Samantha E Smith ◽  
Emma Claire Phillips ◽  
Ailsa Hamilton ◽  
Joanne Kerins ◽  
...  

IntroductionNon-technical skills are recognised to play an integral part in safe and effective patient care. Medi-StuNTS (Medical Students’ Non-Technical Skills) is a behavioural marker system developed to enable assessment of medical students’ non-technical skills. This study aimed to assess whether newly trained raters with high levels of clinical experience could achieve reliability coefficients of >0.7 and to compare differences in inter-rater reliability of raters with varying clinical experience.MethodsForty-four raters attended a workshop on Medi-StuNTS before independently rating three videos of medical students participating in immersive simulation scenarios. Data were grouped by raters’ levels of clinical experience. Inter-rater reliability was assessed by calculating intraclass correlation coefficients (ICC).ResultsEleven raters with more than 10 years of clinical experience achieved single-measure ICC of 0.37 and average-measures ICC of 0.87. Fourteen raters with more than or equal to 5 years and less than 10 years of clinical experience achieved single-measure ICC of 0.09 and average-measures ICC of 0.59. Nineteen raters with less than 5 years of clinical experience achieved single-measure ICC of 0.09 and average-measures ICC 0.65.ConclusionsUsing 11 newly trained raters with high levels of clinical experience produced highly reliable ratings that surpassed the prespecified inter-rater reliability standard; however, a single rater from this group would not achieve sufficiently reliable ratings. This is consistent with previous studies using other medical behavioural marker systems. This study demonstrated a decrease in inter-rater reliability of raters with lower levels of clinical experience, suggesting caution when using this population as raters for assessment of non-technical skills.


Author(s):  
Danielle Fabiana Cucolo ◽  
Márcia Galan Perroca

ABSTRACT Objectives: to verify the reliability and construct validity estimates of the "Assessment of nursing care product" scale (APROCENF) and its applicability. Methods: this validation study included a sample of 40 (inter-rater reliability) and 172 (construct validity) assessments performed by nurses at the end of the work shift at nine inpatient services of a teaching hospital in the Brazilian Southeast. The data were collected between February and September/2014 with interruptions. Cronbach's alpha and Spearman's correlation coefficients were calculated, as well as the intraclass correlation and the weighted kappa index (inter-rater reliability). Exploratory factor analysis was used with principal component extraction and varimax rotation (construct validity). Results: the internal consistency revealed an alpha coefficient of 0.85, item-item correlation ranging between 0.13 and 0.61 and item-total correlation between 0.43 and 0.69. Inter-rater equivalence was obtained and all items evidenced significant factor loadings. Conclusion: this research evidenced the reliability and construct validity of the scale to assess the nursing care product. Its application in nursing practice permits identifying improvements needed in the production process, contributing to management and care decisions.


2017 ◽  
Vol 10 (5) ◽  
pp. 462-466 ◽  
Author(s):  
Scott L Zuckerman ◽  
Nikita Lakomkin ◽  
Jordan A Magarik ◽  
Jan Vargas ◽  
Marcus Stephens ◽  
...  

BackgroundThe angiographic evaluation of previously coiled aneurysms can be difficult yet remains critical for determining re-treatment.ObjectiveThe main objective of this study was to determine the inter-rater reliability for both the Raymond Scale and per cent embolization among a group of neurointerventionalists evaluating previously embolized aneurysms.MethodsA panel of 15 neurointerventionalists examined 92 distinct cases of immediate post-coil embolization and 1 year post-embolization angiographs. Each case was presented four times throughout the study, along with alterations in demographics in order to evaluate intra-rater reliability. All respondents were asked to provide the per cent embolization (0–100%) and Raymond Scale grade (1-3) for each aneurysm. Inter-rater reliability was evaluated by computing weighted kappa values (for the Raymond Scale) and intraclass correlation coefficients (ICC) for per cent embolization.Results10 neurosurgeons and 5 interventional neuroradiologists evaluated 368 simulated cases. The agreement among all readers employing the Raymond Scale was fair (κ=0.35) while concordance in per cent embolization was good (ICC=0.64). Clinicians with fewer than 10 years of experience demonstrated a significantly greater level of agreement than the group with greater than 10 years (κ=0.39 and ICC=0.70 vs κ=0.28 and ICC=0.58). When the same aneurysm was presented multiple times, clinicians demonstrated excellent consistency when assessing per cent embolization (ICC=0.82), but moderate agreement when employing the Raymond classification (κ=0.58).ConclusionsIdentifying the per cent embolization in previously coiled aneurysms resulted in good inter- and intra-rater agreement, regardless of years of experience. The strong agreement among providers employing per cent embolization may make it a valuable tool for embolization assessment in this patient population.


Nutrients ◽  
2021 ◽  
Vol 13 (4) ◽  
pp. 1163
Author(s):  
Suzana Shahar ◽  
Mohd Razif Shahril ◽  
Noraidatulakma Abdullah ◽  
Boekhtiar Borhanuddin ◽  
Mohd Arman Kamaruddin ◽  
...  

Measuring dietary intakes in a multi-ethnic and multicultural setting, such as Malaysia, remains a challenge due to its diversity. This study aims to develop and evaluate the relative validity of an interviewer-administered food frequency questionnaire (FFQ) in assessing the habitual dietary exposure of The Malaysian Cohort (TMC) participants. We developed a nutrient database (with 203 items) based on various food consumption tables, and 803 participants were involved in this study. The output of the FFQ was then validated against three-day 24-h dietary recalls (n = 64). We assessed the relative validity and its agreement using various methods, such as Spearman’s correlation, weighed Kappa, intraclass correlation coefficient (ICC), and Bland–Altman analysis. Spearman’s correlation coefficient ranged from 0.24 (vitamin C) to 0.46 (carbohydrate), and almost all nutrients had correlation coefficients above 0.3, except for vitamin C and sodium. Intraclass correlation coefficients ranged from −0.01 (calcium) to 0.59 (carbohydrates), and weighted Kappa exceeded 0.4 for 50% of nutrients. In short, TMC’s FFQ appears to have good relative validity for the assessment of nutrient intake among its participants, as compared to the three-day 24-h dietary recalls. However, estimates for iron, vitamin A, and vitamin C should be interpreted with caution.


2021 ◽  
Vol 12 ◽  
Author(s):  
Wei Xia ◽  
William Ho Cheung Li ◽  
Tingna Liang ◽  
Yuanhui Luo ◽  
Laurie Long Kwan Ho ◽  
...  

Objectives: This study conducted a linguistic and psychometric evaluation of the Chinese Counseling Competencies Scale-Revised (CCS-R).Methods: The Chinese CCS-R was created from the original English version using a standard forward-backward translation process. The psychometric properties of the Chinese CCS-R were examined in a cohort of 208 counselors-in-training by two independent raters. Fifty-three counselors-in-training were asked to undergo another counseling performance evaluation for the test-retest. The confirmatory factor analysis (CFA) was conducted for the Chinese CCS-R, followed by internal consistency, test-retest reliability, inter-rater reliability, convergent validity, and concurrent validity.Results: The results of the CFA supported the factorial validity of the Chinese CCS-R, with adequate construct replicability. The scale had a McDonald's omega of 0.876, and intraclass correlation coefficients of 0.63 and 0.90 for test-retest reliability and inter-rater reliability, respectively. Significantly positive correlations were observed between the Chinese CCS-R score and scores of performance checklist (Pearson's γ = 0.781), indicating a large convergent validity, and knowledge on drug abuse (Pearson's γ = 0.833), indicating a moderate concurrent validity.Conclusion: The results support that the Chinese CCS-R is a valid and reliable measure of the counseling competencies.Practice implication: The CCS-R provides trainers with a reliable tool to evaluate counseling students' competencies and to facilitate discussions with trainees about their areas for growth.


2019 ◽  
Author(s):  
Marco Bardus ◽  
Nathalie Awada ◽  
Lilian A Ghandour ◽  
Elie-Jacques Fares ◽  
Tarek Gherbal ◽  
...  

BACKGROUND With thousands of health apps in app stores globally, it is crucial to systemically and thoroughly evaluate the quality of these apps due to their potential influence on health decisions and outcomes. The Mobile App Rating Scale (MARS) is the only currently available tool that provides a comprehensive, multidimensional evaluation of app quality, which has been used to compare medical apps from American and European app stores in various areas, available in English, Italian, Spanish, and German. However, this tool is not available in Arabic. OBJECTIVE This study aimed to translate and adapt MARS to Arabic and validate the tool with a sample of health apps aimed at managing or preventing obesity and associated disorders. METHODS We followed a well-established and defined “universalist” process of cross-cultural adaptation using a mixed methods approach. Early translations of the tool, accompanied by confirmation of the contents by two rounds of separate discussions, were included and culminated in a final version, which was then back-translated into English. Two trained researchers piloted the MARS in Arabic (MARS-Ar) with a sample of 10 weight management apps obtained from Google Play and the App Store. Interrater reliability was established using intraclass correlation coefficients (ICCs). After reliability was ascertained, the two researchers independently evaluated a set of additional 56 apps. RESULTS MARS-Ar was highly aligned with the original English version. The ICCs for MARS-Ar (0.836, 95% CI 0.817-0.853) and MARS English (0.838, 95% CI 0.819-0.855) were good. The MARS-Ar subscales were highly correlated with the original counterparts (<i>P</i>&lt;.001). The lowest correlation was observed in the area of usability (<i>r</i>=0.685), followed by aesthetics (<i>r</i>=0.827), information quality (<i>r</i>=0.854), engagement (<i>r</i>=0.894), and total app quality (<i>r</i>=0.897). Subjective quality was also highly correlated (<i>r</i>=0.820). CONCLUSIONS MARS-Ar is a valid instrument to assess app quality among trained Arabic-speaking users of health and fitness apps. Researchers and public health professionals in the Arab world can use the overall MARS score and its subscales to reliably evaluate the quality of weight management apps. Further research is necessary to test the MARS-Ar on apps addressing various health issues, such as attention or anxiety prevention, or sexual and reproductive health.


1999 ◽  
Vol 8 (4) ◽  
pp. 254-261 ◽  
Author(s):  
J Powers ◽  
SJ Bennett

BACKGROUND: Dyspnea, or difficult breathing, is common in patients receiving mechanical ventilation; however, dyspnea is not routinely or systematically measured. OBJECTIVE: The primary purpose of this methodological study was to evaluate the test-retest reliability of 5 dyspnea rating scales and the criterion validity of 4 dyspnea rating scales in patients receiving mechanical ventilation. The secondary purpose was to examine the correlations between each of these 5 rating scales and physiological measures of respiratory function. METHODS: The convenience sample consisted of 28 patients on mechanical ventilation during their hospitalization in the intensive care units of a large, inner-city hospital. Patients rated their dyspnea twice at 30-minute intervals on the visual analogue scale, the vertical analogue dyspnea scale, the modified Borg scale, the numerical scale, and the faces scale. Test-retest reliability was computed by using the intraclass correlation coefficient. Criterion validity was evaluated by using the Spearman rank-order correlation coefficient. RESULTS: The 5 rating scales had acceptable test-retest reliabilities, with intraclass correlation coefficients ranging from 0.81 to 0.97. Criterion validity of the 4 scales also was acceptable, with Spearman rank-order correlation coefficients from 0.76 to 0.96. The rating scales were not correlated with most of the physiological variables. At least half of the patients reported moderate to severe dyspnea. CONCLUSION: The scales showed acceptable reliability and validity, and they will be useful in quantifying dyspnea experienced by patients receiving mechanical ventilation. Further work is needed to evaluate the extent and the severity of dyspnea in such patients in order to evaluate the effectiveness of interventions.


2018 ◽  
Vol 63 (4) ◽  
pp. 453-460 ◽  
Author(s):  
Vahid Abdollah ◽  
Eric C. Parent ◽  
Michele C. Battié

Abstract Degenerated discs have shorter T2-relaxation time and lower MR signal. The location of the signal-intensity-weighted-centroid reflects the water distribution within a region-of-interest (ROI). This study compared the reliability of the location of the signal-intensity-weighted-centroid to mean signal intensity and area measurements. L4-L5 and L5-S1 discs were measured on 43 mid-sagittal T2-weighted 3T MRI images in adults with back pain. One rater analysed images twice and another once, blinded to measurements. Discs were semi-automatically segmented into a whole disc, nucleus, anterior and posterior annulus. The coordinates of the signal-intensity-weighted-centroid for all regions demonstrated excellent intraclass-correlation-coefficients for intra- (0.99–1.00) and inter-rater reliability (0.97–1.00). The standard error of measurement for the Y-coordinates of the signal-intensity-weighted-centroid for all ROIs were 0 at both levels and 0 to 2.7 mm for X-coordinates. The mean signal intensity and area for the whole disc and nucleus presented excellent intra-rater reliability with intraclass-correlation-coefficients from 0.93 to 1.00, and 0.92 to 1.00 for inter-rater reliability. The mean signal intensity and area had lower reliability for annulus ROIs, with intra-rater intraclass-correlation-coefficient from 0.5 to 0.76 and inter-rater from 0.33 to 0.58. The location of the signal-intensity-weighted-centroid is a reliable biomarker for investigating the effects of disc interventions.


2019 ◽  
Vol 91 (1) ◽  
pp. 75-81 ◽  
Author(s):  
Leonhard A Bakker ◽  
Carin D Schröder ◽  
Harold H G Tan ◽  
Simone M A G Vugts ◽  
Ruben P A van Eijk ◽  
...  

ObjectiveThe Amyotrophic Lateral Sclerosis Functional Rating Scale-Revised (ALSFRS-R) is widely applied to assess disease severity and progression in patients with motor neuron disease (MND). The objective of the study is to assess the inter-rater and intra-rater reproducibility, i.e., the inter-rater and intra-rater reliability and agreement, of a self-administration version of the ALSFRS-R for use in apps, online platforms, clinical care and trials.MethodsThe self-administration version of the ALSFRS-R was developed based on both patient and expert feedback. To assess the inter-rater reproducibility, 59 patients with MND filled out the ALSFRS-R online and were subsequently assessed on the ALSFRS-R by three raters. To assess the intra-rater reproducibility, patients were invited on two occasions to complete the ALSFRS-R online. Reliability was assessed with intraclass correlation coefficients, agreement was assessed with Bland-Altman plots and paired samples t-tests, and internal consistency was examined with Cronbach’s coefficient alpha.ResultsThe self-administration version of the ALSFRS-R demonstrated excellent inter-rater and intra-rater reliability. The assessment of inter-rater agreement demonstrated small systematic differences between patients and raters and acceptable limits of agreement. The assessment of intra-rater agreement demonstrated no systematic changes between time points; limits of agreement were 4.3 points for the total score and ranged from 1.6 to 2.4 points for the domain scores. Coefficient alpha values were acceptable.DiscussionThe self-administration version of the ALSFRS-R demonstrates high reproducibility and can be used in apps and online portals for both individual comparisons, facilitating the management of clinical care and group comparisons in clinical trials.


Sign in / Sign up

Export Citation Format

Share Document