Inter-Rater Reliability: Intraclass Correlation Coefficients

Abstract Degenerated discs have shorter T2-relaxation time and lower MR signal. The location of the signal-intensity-weighted-centroid reflects the water distribution within a region-of-interest (ROI). This study compared the reliability of the location of the signal-intensity-weighted-centroid to mean signal intensity and area measurements. L4-L5 and L5-S1 discs were measured on 43 mid-sagittal T2-weighted 3T MRI images in adults with back pain. One rater analysed images twice and another once, blinded to measurements. Discs were semi-automatically segmented into a whole disc, nucleus, anterior and posterior annulus. The coordinates of the signal-intensity-weighted-centroid for all regions demonstrated excellent intraclass-correlation-coefficients for intra- (0.99–1.00) and inter-rater reliability (0.97–1.00). The standard error of measurement for the Y-coordinates of the signal-intensity-weighted-centroid for all ROIs were 0 at both levels and 0 to 2.7 mm for X-coordinates. The mean signal intensity and area for the whole disc and nucleus presented excellent intra-rater reliability with intraclass-correlation-coefficients from 0.93 to 1.00, and 0.92 to 1.00 for inter-rater reliability. The mean signal intensity and area had lower reliability for annulus ROIs, with intra-rater intraclass-correlation-coefficient from 0.5 to 0.76 and inter-rater from 0.33 to 0.58. The location of the signal-intensity-weighted-centroid is a reliable biomarker for investigating the effects of disc interventions.

Download Full-text

Reliability and validity of the iSense optical scanner for measuring volume of transtibial residual limb models

Prosthetics and Orthotics International ◽

10.1177/0309364618806038 ◽

2018 ◽

Vol 43 (2) ◽

pp. 213-220 ◽

Cited By ~ 1

Author(s):

Lucy Armitage ◽

Li Khim Kwah ◽

Lauren Kark

Keyword(s):

Intraclass Correlation ◽

Reliability And Validity ◽

Correlation Coefficients ◽

Criterion Validity ◽

Residual Limb ◽

Limb Volume ◽

Rater Reliability ◽

Measuring Volume ◽

Intraclass Correlation Coefficients ◽

Optical Scanner

Background: Residual limb volume is often measured as part of routine care for people with amputations. These measurements assist in the timing of prosthetic fitting or replacement. In order to make well informed decisions, clinicians need access to measurement tools that are valid and reliable. Objectives: To assess the reliability and criterion validity of the iSense optical scanner in measuring volume of transtibial residual limb models. Study Design: Three assessors performed two measurements each on 13 residual limb models with an iSense optical scanner (3D systems, USA). Intra-rater and inter-rater reliability were calculated using intraclass correlation coefficients. Bland Altman plots were inspected for agreement. Criterion validity was assessed using a steel rod of known dimensions. Ten repeated measurements were performed by one assessor. A t-test was used to determine differences between measured and true rod volume. Results: Intra-rater reliability was excellent (range of intraclass correlation coefficients: 0.991–0.997, all with narrow 95% confidence intervals). While the intraclass correlation coefficients suggest excellent inter-rater reliability between all three assessors (range of intraclass correlation coefficients: 0.952–0.986), the 95% confidence intervals were wide between assessor 3 and the other two assessors. Poor agreement with assessor 3 was also seen in the Bland-Altman plots. Criterion validity was very poor with a significant difference between the mean iSense measurement and the true rod volume (difference: 221.18 mL; p < 0.001). Conclusions: Although intra-rater reliability was excellent for the iSense scanner, we did not find similar results for inter-rater reliability and validity. These results suggest that further testing of the iSense scanner is required prior to use in clinical practice. Clinical relevance The iSense offers a low cost scanning option for residual limb volume measurement. Intra-rater reliability was excellent, but inter-rater reliability and validity were such that clinical adoption is not indicated at present.

Download Full-text

Reliability of Autism-Tics, AD/HD, and other Comorbidities (A–TAC) Inventory in a Test-Retest Design

Psychological Reports ◽

10.2466/03.15.pr0.114k10w1 ◽

2014 ◽

Vol 114 (1) ◽

pp. 93-103 ◽

Cited By ~ 15

Author(s):

Tomas Larson ◽

Eva Norén Selinus ◽

Clara Hellner Gumpert ◽

Thomas Nilsson ◽

Nóra Kerekes ◽

...

Keyword(s):

Intraclass Correlation ◽

Correlation Coefficients ◽

Population Based ◽

Autism Spectrum ◽

Good Test ◽

Rater Reliability ◽

Retest Reliability ◽

Intraclass Correlation Coefficients ◽

Intraclass Correlations ◽

Test Retest Reliability

The Autism-Tics, AD/HD, and other Comorbidities (A–TAC) inventory is used in epidemiological research to assess neurodevelopmental problems and coexisting conditions. Although the A–TAC has been applied in various populations, data on retest reliability are limited. The objective of the present study was to present additional reliability data. The A–TAC was administered by lay assessors and was completed on two occasions by parents of 400 individual twins, with an average interval of 70 days between test sessions. Intra- and inter-rater reliability were analysed with intraclass correlations and Cohen's κ. A–TAC showed excellent test-retest intraclass correlations for both autism spectrum disorder and attention deficit hyperactivity disorder (each at .84). Most modules in the A–TAC had intra- and inter-rater reliability intraclass correlation coefficients of ≥ .60. Cohen's κ indicated acceptable reliability. The current study provides statistical evidence that the A–TAC yields good test-retest reliability in a population-based cohort of children.

Download Full-text

Reliability of assessment of medical students’ non-technical skills using a behavioural marker system: does clinical experience matter?

BMJ Simulation and Technology Enhanced Learning ◽

10.1136/bmjstel-2020-000705 ◽

2020 ◽

pp. bmjstel-2020-000705

Author(s):

Benjamin Clarke ◽

Samantha E Smith ◽

Emma Claire Phillips ◽

Ailsa Hamilton ◽

Joanne Kerins ◽

...

Keyword(s):

Medical Students ◽

Clinical Experience ◽

Intraclass Correlation ◽

Correlation Coefficients ◽

Technical Skills ◽

Rater Reliability ◽

Single Measure ◽

Marker System ◽

Intraclass Correlation Coefficients ◽

Reliability Coefficients

IntroductionNon-technical skills are recognised to play an integral part in safe and effective patient care. Medi-StuNTS (Medical Students’ Non-Technical Skills) is a behavioural marker system developed to enable assessment of medical students’ non-technical skills. This study aimed to assess whether newly trained raters with high levels of clinical experience could achieve reliability coefficients of >0.7 and to compare differences in inter-rater reliability of raters with varying clinical experience.MethodsForty-four raters attended a workshop on Medi-StuNTS before independently rating three videos of medical students participating in immersive simulation scenarios. Data were grouped by raters’ levels of clinical experience. Inter-rater reliability was assessed by calculating intraclass correlation coefficients (ICC).ResultsEleven raters with more than 10 years of clinical experience achieved single-measure ICC of 0.37 and average-measures ICC of 0.87. Fourteen raters with more than or equal to 5 years and less than 10 years of clinical experience achieved single-measure ICC of 0.09 and average-measures ICC of 0.59. Nineteen raters with less than 5 years of clinical experience achieved single-measure ICC of 0.09 and average-measures ICC 0.65.ConclusionsUsing 11 newly trained raters with high levels of clinical experience produced highly reliable ratings that surpassed the prespecified inter-rater reliability standard; however, a single rater from this group would not achieve sufficiently reliable ratings. This is consistent with previous studies using other medical behavioural marker systems. This study demonstrated a decrease in inter-rater reliability of raters with lower levels of clinical experience, suggesting caution when using this population as raters for assessment of non-technical skills.

Download Full-text

Vein Measurement by Peripherally Inserted Central Catheter Nurses Using Ultrasound: A Reliability Study

Journal of the Association for Vascular Access ◽

10.1016/j.java.2013.08.001 ◽

2013 ◽

Vol 18 (4) ◽

pp. 234-238 ◽

Cited By ~ 8

Author(s):

Rebecca Sharp ◽

Andrea Gordon ◽

Antonina Mikocka-Walus ◽

Jessie Childs ◽

Carol Grech ◽

...

Keyword(s):

Intraclass Correlation ◽

Correlation Coefficients ◽

Cephalic Vein ◽

Basilic Vein ◽

Rater Reliability ◽

Vein Thrombosis ◽

Intraclass Correlation Coefficients ◽

Measurement Protocol ◽

Brachial Vein ◽

Deep Vein

Abstract Background: Peripherally inserted central catheters (PICCs) are increasingly inserted by trained registered nurses, necessitating the development of specialized skills such as the use of ultrasound. The selection of an adequately sized vein is an important factor in reducing adverse events such as deep vein thrombosis. However, PICC nurses may receive minimal training in the use of ultrasound for vein measurement. Objective: We aimed to demonstrate the reliability of a vein measurement protocol using ultrasound by a PICC nurse trained in sonography. Methods: The diameter of the basilic, brachial, and cephalic veins in the left arms of healthy participants (n =12) were measured using ultrasound by a PICC nurse and a sonographer. A PICC nurse performed the measurement twice and the sonographer once; the PICC nurse's results were compared for intra-rater reliability and compared with the sonographer for inter-rater reliability. The results were analyzed using intraclass correlation coefficients (ICCs). Results: Inter-rater reliability between the PICC nurse and the sonographer was adequate, the ICC for the brachial vein was 0.60 (95% confidence interval [CI], 0.06–0.87), basilic vein ICC was 0.87 (95% CI, 0.58–0.96) and cephalic vein ICC was 0.77 (95% CI, 0.39–0.93). Intra-rater reliability of the PICC nurse was higher; the ICC for the brachial vein was 0.80 (95% CI, 0.44–0.94), basilic vein ICC was 0.92 (95% CI, 0.67–0.98), and cephalic vein ICC was 0.78 (95% CI, 0.40–0.93). Conclusions: Using a suitable protocol, a PICC nurse was able to measure vein diameter reliably when compared with a sonographer and consistently replicate these results.

Download Full-text

Intra and Inter-rater Reliability between Ultrasound Imaging and Caliper Measures to determine Spring Ligament Dimensions in Cadavers

Scientific Reports ◽

10.1038/s41598-019-51384-6 ◽

2019 ◽

Vol 9 (1) ◽

Author(s):

Fernando Santiago-Nuño ◽

Patricia Palomo-López ◽

Ricardo Becerro-de-Bengoa-Vallejo ◽

César Calvo-Lobo ◽

Marta Elena Losa-Iglesias ◽

...

Keyword(s):

Ultrasound Imaging ◽

Intraclass Correlation ◽

Correlation Coefficients ◽

Absolute Accuracy ◽

Strong Correlations ◽

Perfect Agreement ◽

Rater Reliability ◽

Intraclass Correlation Coefficients ◽

Spring Ligament ◽

Good Repeatability

Abstract The purpose was to evaluate intra and inter-rater reliability, repeatability and absolute accuracy between ultrasound imaging (US) and caliper measures to determine Spring ligament (SL) dimensions in cadavers. SLs were identified from 62 human feet from formaldehyde-embalmed cadavers. Intra and inter-observer reliability, repeatability and absolute accuracy of SL width, thickness and length between US and caliper measurements were determined at intra and inter-session by intraclass correlation coefficients, Pearson´s correlation coefficients, Student t tests, standard errors of measurement, minimum detectable changes, values of normality, 95% limits of agreement, and Bland-Altman plots. Excellent inter-session and inter-rater reliability, adequate absolute accuracy, almost perfect agreement and strong correlations were shown for caliper, US and their comparison for all SL dimensions. US measurements presented higher absolute accuracy than caliper measures for SL length and thickness dimensions, while caliper displayed greater absolute accuracy for SL width dimensions. Good repeatability (P > 0.05) was shown for all SL dimensions by US, caliper and their comparison, except for SL width dimension measured with US (P = 0.019). Both US and caliper could be recommended for all SL dimensions evaluation due to their excellent reliability and absolute accuracy in cadavers, although width dimensions should be considered with caution due to US repeatability differences.

Download Full-text

Appraisal of a scoring instrument for training and testing neonatal intubation skills

Archives of Disease in Childhood - Fetal and Neonatal Edition ◽

10.1136/archdischild-2018-315221 ◽

2018 ◽

Vol 104 (5) ◽

pp. F521-F527 ◽

Cited By ~ 1

Author(s):

Romy N Bouwmeester ◽

Mathijs Binkhorst ◽

Nicole K Yamada ◽

Rosa Geurtzen ◽

Arno F J van Heijst ◽

...

Keyword(s):

Construct Validity ◽

Intraclass Correlation ◽

Correlation Coefficients ◽

Rater Reliability ◽

Patient Simulator ◽

Training Centre ◽

Intraclass Correlation Coefficients ◽

Tube Position ◽

The Usa ◽

Neonatal Patient

ObjectiveTo determine the validity, reliability, feasibility and applicability of a neonatal intubation scoring instrument.DesignProspective observational study.SettingSimulation-based research and training centre (Center for Advanced Pediatric and Perinatal Education), California, USA.SubjectsForty clinicians qualified for neonatal intubation.InterventionsVideotaped elective intubations on a neonatal patient simulator were scored by two independent raters. One rater scored the intubations twice. We scored the preparation of equipment and premedication, intubation performance, tube position/fixation, communication, number of attempts, duration and successfulness of the procedure.Main outcome measuresIntraclass correlation coefficients (ICC) were calculated for intrarater and inter-rater reliability. Kappa coefficients for individual items and mean kappa coefficients for all items combined were calculated. Construct validity was assessed with one-way analysis of variance using the hypothesis that experienced clinicians score higher than less experienced clinicians. The approximate time to score one intubation and the instrument’s applicability in another setting were evaluated.ResultsICCs for intrarater and inter-rater reliability were 0.99 (95% CI 0.98 to 0.99) and 0.89 (95% CI 0.35 to 0.96), and mean kappa coefficients were 0.93 (95% CI 0.85 to 1.01) and 0.71 (95% CI 0.56 to 0.92), respectively. There were no differences between the more and less experienced clinicians regarding preparation, performance, communication and total scores. The experienced group scored higher only on tube position/fixation (p=0.02). Scoring one intubation took approximately 15 min. Our instrument, developed in The Netherlands, could be readily applied in the USA.ConclusionsOur scoring instrument for simulated neonatal intubations appears to be reliable, feasible and applicable in another centre. Construct validity could not be established.

Download Full-text

Intra- and inter-rater reliability of the Behaviour Mapping Schedule: A direct observational tool for classifying children’s play behaviour

Australasian Journal of Early Childhood ◽

10.1177/1836939120982764 ◽

2021 ◽

pp. 183693912098276

Author(s):

Kylie A Dankiw ◽

Katherine L Baldock ◽

Saravana Kumar ◽

Margarita D Tsiros

Keyword(s):

Intraclass Correlation ◽

South Australia ◽

Correlation Coefficients ◽

Rater Reliability ◽

Children's Play ◽

Intraclass Correlation Coefficients ◽

Observational Tool ◽

Play Behaviour ◽

Children’S Play ◽

Training Resources

Identifying and describing children’s play behaviours is an important component of evaluating child development. The Behaviour Mapping Schedule is a direct observational tool which aims to describe and quantify children’s play behaviours but is yet to undergo reliability testing. This study aimed to determine the intra- and inter-rater reliability of the Behaviour Mapping Schedule. Twelve children aged 3–5 years were each video recorded for one 20-minute playtime period at a purposively selected Community Children’s Centre in Adelaide, South Australia. The video recordings were coded independently by two raters against 23 behaviour codes. Intraclass correlation coefficients (ICCs) were calculated. Intra-rater ICCs for nearly 70% of the behaviour codes were considered ‘excellent’; likewise, for inter-rater ICCs on more than 50% of the behaviour codes. Overall, the Behaviour Mapping Schedule is a reliable tool for observing children’s play behaviour; however, additional training resources may be useful to further strengthen inter-rater reliability.

Download Full-text

Validation of the “Inflammatory Bowel Disease - Distribution, Chronicity, Activity (IBD-DCA) Score” for Ulcerative Colitis and Crohn´s disease

Journal of Crohn s and Colitis ◽

10.1093/ecco-jcc/jjab055 ◽

2021 ◽

Author(s):

Corinna Lang-Schwarz ◽

Miriam Angeloni ◽

Abbas Agaimy ◽

Raja Atreya ◽

Christoph Becker ◽

...

Keyword(s):

Ulcerative Colitis ◽

Inflammatory Bowel Disease ◽

Disease Activity ◽

Bowel Disease ◽

Intraclass Correlation ◽

Correlation Coefficients ◽

Rater Reliability ◽

Intraclass Correlation Coefficients ◽

External Responsiveness ◽

Inflammatory Bowel

Abstract Background and aims Histological scoring plays a key role in the assessment of disease activity in ulcerative colitis (UC) and is also important in Crohn´s disease (CD). Currently, there is no common scoring available for UC and CD. We aimed to validate the Inflammatory Bowel Disease (IBD) – Distribution (D), Chronicity (C), Activity (A) score (IBD-DCA score) for histological disease activity assessment in IBD. Methods Inter- and intra-rater reliability were assessed by 16 observers on biopsy specimen from 59 patients with UC and 25 patients with CD. Construct validity and responsiveness to treatment were retrospectively evaluated on a second cohort of 30 patients. Results Inter-rater reliability was moderate to good for the UC cohort (intraclass correlation coefficients (ICCs) = 0.645, 0.623, 0.767 for D, C and A, respectively) and at best moderate for the CD cohort (ICC = 0.690, 0.303, 0.733 for D, C and A, respectively). Intra-rater agreement ranged from good to excellent in both cohorts. Correlation with the Nancy Histological Index (NHI) was moderate and strong with the Simplified Geboes Score (SGS) and a Visual Analog Scale (VAS). Large effect sizes (ES) were obtained for all three parameters. External responsiveness analysis revealed correlated changes between IBD-DCA score and NHI, SGS and VAS. Conclusions The IBD-DCA score is a simple histological activity score for UC and CD, agreed and validated by a large group of IBD specialists. It provides reliable information on treatment response. Therefore, it has potential value for use in routine diagnostics as well as clinical studies.

Download Full-text

Inter-rater reliability of two paediatric early warning score tools

Dansk Tidsskrift for Akutmedicin ◽

10.7146/akut.v2i3.112944 ◽

2019 ◽

Vol 2 (3) ◽

pp. 37

Author(s):

Claus Sixtus Jensen

Keyword(s):

Early Warning ◽

Intraclass Correlation ◽

Healthcare Providers ◽

Correlation Coefficients ◽

Assessment Tools ◽

Early Warning Score ◽

Rater Reliability ◽

Intraclass Correlation Coefficients ◽

Paediatric Early Warning Score ◽

The Individual

Background: Paediatric early warning score (PEWS) assessment tools can assist healthcare providers in the timely detection and recognition of subtle patient condition changes signalling clinical deterioration. However, PEWS tools instrument data are only as reliable and accurate as the caregivers who obtain and document the parameters. The aim of this study is to evaluate inter-rater reliability among nurses using PEWS systems. Method: The study was carried out in five paediatrics departments in the Central Denmark Region. Inter-rater reliability was investigated through parallel observations. A total of 108 children and 69 nurses participated. Two nurses simultaneously performed a PEWS assessment on the same patient. Before the assessment, the two participating nurses drew lots to decide who would be the active observer. Intraclass correlation coefficient, Fleiss’ κand Bland–Altman limits of agreement were used to determine inter-rater reliability. Results: The intraclass correlation coefficients for the aggregated PEWS score of the two PEWS models were 0.98 and 0.95, respectively. The κvalue on the individual PEWS measurements ranged from 0.70 to 1.0, indicating good to very good agreement. The nurses assigned the exact same aggregated score for both PEWS models in 76% of the cases. In 98% of the PEWS assessments, the aggregated PEWS scores assigned by the nurses were equal to or below 1 point in both models. Conclusion: The study showed good to very good interrater reliability in the two PEWS models used in the Central Denmark Region.

Download Full-text