Interjudge Reliability and Ratings Variability in Music Performance Assessment

Author(s):  
Daniel Massoth

When technology is used for assessment in music, certain considerations can affect the validity, reliability, and depth of analysis. This chapter explores factors that are present in the three phases of the assessment process: recognition, analysis, and display of assessment of a musical performance. Each phase has inherent challenges embedded within internal and external factors. The goal here is not to provide an exhaustive analysis of any or all aspects of assessment but, rather, to present the rationale for and history of using technology in music assessment and to examine the philosophical and practical considerations. A discussion of possible future directions of product research and development concludes the chapter.


2009 ◽  
Vol 57 (1) ◽  
pp. 5-15 ◽  
Author(s):  
Charles R. Ciorba ◽  
Neal Y. Smith

Recent policy initiatives instituted by major accrediting bodies require the implementation of specific assessment tools to provide evidence of student achievement in a number of areas, including applied music study. The purpose of this research was to investigate the effectiveness of a multidimensional assessment rubric, which was administered to all students performing instrumental and vocal juries at a private Midwestern university during one semester ( N = 359). Interjudge reliability coefficients indicated a moderate to high level of agreement among judges. Results also revealed that performance achievement was positively related to participants' year in school (freshman, sophomore, junior, and senior), which indicates that a multidimensional assessment rubric can effectively measure students' achievement in the area of solo music performance.


2003 ◽  
Vol 51 (2) ◽  
pp. 137-150 ◽  
Author(s):  
Martin J. Bergee

Assessment of music performance in authentic contexts remains an underinvestigated area of research. This study is an examination of one such context, the inter-judge reliability of faculty evaluation of end-of-semester applied music performances. Brass (n = 4), percussion (n = 2), woodwind (n = 5), voice (n = 5), piano (n = 3), and string (n = 5) instructors evaluating a recent semester's applied music juries at a large university participated in the study. Each evaluator completed a criterion-specific rating scale for each performer and assigned each performance a global letter grade not shared with other evaluators or with the performer. Interjudge reliability was determined for each group's rating scale total scores, subscale scores, and the letter-grade assessment. All possible permutations of two, three, and four were examined for interjudge reliability, and averaged correlations, standard deviations, and ranges were determined. Full-panel interjudge reliability was consistently good regardless of panel size. All total score reliability coefficients were statistically significant, as were all coefficients for the global letter-grade assessment. All subscale reliabilities for all groups except Percussion (which, with an n of 2, had a stringent significance criterion) were statistically significant, with the exception of the Suitability subscale in Voice. For larger panels (ns of 4 and 5), rating scale total score reliability was consistently but not greatly higher than reliability for the letter-grade assessment. There was no decrease of average reliability as group size incrementally decreased. Permutations of two and three evaluators, however, tended on average to exhibit more variability, greater range, and less uniformity than did groups of four and five. No differences in reliability were noted among levels of experience or between teaching assistants and faculty members. Use of a minimum of five adjudicators for performance evaluation in this context was recommended.


2007 ◽  
Vol 55 (4) ◽  
pp. 344-358 ◽  
Author(s):  
Martin J. Bergee

This study examined performer, rater, occasion, and sequence as sources of variability in music performance assessment. Generalizability theory served as the study's basis. Performers were 8 high school wind instrumentalists who had recently performed a solo. The author audio-recorded performers playing excerpts from their solo three times, establishing an occasion variable. To establish a rater variable, 10 certified adjudicators were asked to rate the performances from 0 (poor) to 100 (excellent). Raters were randomly assigned to one of five performance sequences, thus nesting raters within a sequence variable. Two G (generalizability) studies established that occasion and sequence produced virtually no measurement error. Raters were a strong source of error. D (decision) studies established the one-rater, one-occasion scenario as unreliable. In scenarios using the generalizability coefficient as a criterion, 5 hypothetical raters were necessary to reach the .80 benchmark. Using the dependability index, 17 hypothetical raters were necessary to reach .80.


2014 ◽  
Vol 101 (1) ◽  
pp. 70-76 ◽  
Author(s):  
Christopher DeLuca ◽  
Benjamin Bolden

Sign in / Sign up

Export Citation Format

Share Document