Exploring decision consistency and decision accuracy across rating designs in rater-mediated music performance assessments
Music performance assessments frequently include panels of raters who evaluate the quality of musical performances using rating scales. As a result of practical considerations, it is often not possible to obtain ratings from every rater on every performance (i.e., complete rating designs). When there are differences in rater severity, and not all raters rate all performances, ratings of musical performances and their resulting classification (e.g., pass or fail) depend on the “luck of the rater draw.” In this study, we explored the implications of different types of incomplete rating designs for the classification of musical performances in rater-mediated musical performance assessments. We present a procedure that researchers and practitioners can use to adjust student scores for differences in rater severity when incomplete rating designs are used, and we consider the effects of the adjustment procedure across different types of rating designs. Our results suggested that differences in rater severity have large practical consequences for ratings of musical performances that impact individual students and group of students differently. Furthermore, our findings suggest that it is possible to adjust musical performance ratings for differences in rater severity as long as there are common raters across scoring panels. We consider the implications of our findings as they relate to music assessment research and practice.