Modeling the Effects of Training and Trainers on Rater Accuracy

Abstract Objectives Hippocampal sclerosis (HS) is a common cause of temporal lobe epilepsy. Neuroradiological practice relies on visual assessment, but quantification of HS imaging biomarkers—hippocampal volume loss and T2 elevation—could improve detection. We tested whether quantitative measures, contextualised with normative data, improve rater accuracy and confidence. Methods Quantitative reports (QReports) were generated for 43 individuals with epilepsy (mean age ± SD 40.0 ± 14.8 years, 22 men; 15 histologically unilateral HS; 5 bilateral; 23 MR-negative). Normative data was generated from 111 healthy individuals (age 40.0 ± 12.8 years, 52 men). Nine raters with different experience (neuroradiologists, trainees, and image analysts) assessed subjects’ imaging with and without QReports. Raters assigned imaging normal, right, left, or bilateral HS. Confidence was rated on a 5-point scale. Results Correct designation (normal/abnormal) was high and showed further trend-level improvement with QReports, from 87.5 to 92.5% (p = 0.07, effect size d = 0.69). Largest magnitude improvement (84.5 to 93.8%) was for image analysts (d = 0.87). For bilateral HS, QReports significantly improved overall accuracy, from 74.4 to 91.1% (p = 0.042, d = 0.7). Agreement with the correct diagnosis (kappa) tended to increase from 0.74 (‘fair’) to 0.86 (‘excellent’) with the report (p = 0.06, d = 0.81). Confidence increased when correctly assessing scans with the QReport (p < 0.001, η2p = 0.945). Conclusions QReports of HS imaging biomarkers can improve rater accuracy and confidence, particularly in challenging bilateral cases. Improvements were seen across all raters, with large effect sizes, greatest for image analysts. These findings may have positive implications for clinical radiology services and justify further validation in larger groups. Key Points • Quantification of imaging biomarkers for hippocampal sclerosis—volume loss and raised T2 signal—could improve clinical radiological detection in challenging cases. • Quantitative reports for individual patients, contextualised with normative reference data, improved diagnostic accuracy and confidence in a group of nine raters, in particular for bilateral HS cases. • We present a pre-use clinical validation of an automated imaging assessment tool to assist clinical radiology reporting of hippocampal sclerosis, which improves detection accuracy.

Download Full-text

Investigating rater accuracy in the context of secondary-level solo instrumental music performance

Musicae Scientiae ◽

10.1177/1029864917713805 ◽

2017 ◽

Vol 23 (2) ◽

pp. 157-176 ◽

Cited By ~ 1

Author(s):

Brian C. Wesolowski ◽

Stefanie A. Wind

Keyword(s):

Student Performance ◽

Instrumental Music ◽

Music Performance ◽

Secondary Level ◽

Measurement Model ◽

Performance Assessments ◽

Rater Training ◽

Solo Music ◽

Rater Accuracy ◽

Assessment Context

In any performance-based musical assessment context, construct-irrelevant variability attributed to raters is a cause of concern when constructing a validity argument. Therefore, evidence of rater quality is a necessary criterion for psychometrically sound (i.e., valid, reliable, and fair) rater-mediated music performance assessments. Rater accuracy is a type of rater quality index that measures the distance between raters’ operational ratings and an expert’s criterion ratings on a set of benchmark, exemplar, or anchor musical performances. The purpose of this study was to examine the quality of ratings in the context of a secondary-level solo music performance assessment using a Multifaceted Rasch Rater Accuracy (MFR-RA) measurement model. This study was guided by the following research questions: (a) overall, how accurate were the rater judgments in the assessment context? (b) how accurate were the rater judgments across each of the items of the rubric?, and (c) how accurate were the rater judgments across each of the domains of the rubric? Results indicated that accuracy scores generally matched the expectations of the MFR-RA model, with rater locations higher than the average student performance, item, and domain locations, indicating that the student performances, items, and domains were relatively easy to rate accurately for the sample of raters examined in this study. Overall, rater accuracy ranged from 0.54 logits ( SE = 0.05) for the most accurate rater to 0.24 logits ( SE = 0.04) for the least accurate rater. Difficulty of rater accuracy across items indicated a range of 0.91 logits ( SE = 0.08) to -1.83 logits ( SE = 0.17). Difficulty of rater accuracy across domains ranged from 0.25 logits ( SE = 0.08) to -0.68 logits ( SE = 0.17). Implications for the improvement of music performance assessments with specific regard to rater training are discussed.

Download Full-text