Using the Many-Faceted Rasch Model to Evaluate Standard Setting Judgments

2012 ◽  
Vol 73 (3) ◽  
pp. 386-411 ◽  
Author(s):  
Pamela K. Kaliski ◽  
Stefanie A. Wind ◽  
George Engelhard ◽  
Deanna L. Morgan ◽  
Barbara S. Plake ◽  
...  
2010 ◽  
Vol 42 (4) ◽  
pp. 944-956 ◽  
Author(s):  
Michelangelo Vianello ◽  
Egidio Robusto

2008 ◽  
Vol 17 (3) ◽  
pp. 47-68 ◽  
Author(s):  
Jason E. Chapman ◽  
Ashli J. Sheidow ◽  
Scott W. Henggeler ◽  
Colleen A. Halliday-Boykins ◽  
Phillippe B. Cunningham

2021 ◽  
Author(s):  
Nazdar E. Alkhateeb ◽  
Ali Al-Dabbagh ◽  
Yaseen Mohammed ◽  
Mohammed Ibrahim

Any high-stakes assessment that leads to an important decision requires careful consideration in determining whether a student passes or fails. Despite the implementation of many standard-setting methods in clinical examinations, concerns remain about the reliability of pass/fail decisions in high stakes assessment, especially clinical assessment. This observational study proposes a defensible pass/fail decision based on the number of failed competencies. In the study conducted in Erbil, Iraq, in June 2018, results were obtained for 150 medical students on their final objective structured clinical examination. Cutoff scores and pass/fail decisions were calculated using the modified Angoff, borderline, borderline-regression, and holistic methods. The results were compared with each other and with a new competency method using Cohen’s kappa. Rasch analysis was used to compare the consistency of competency data with Rasch model estimates. The competency method resulted in 40 (26.7%) students failing, compared with 76 (50.6%), 37 (24.6%), 35 (23.3%), and 13 (8%) for the modified Angoff, borderline, borderline regression, and holistic methods, respectively. The competency method demonstrated a sufficient degree of fit to the Rasch model (mean outfit and infit statistics of 0.961 and 0.960, respectively). In conclusion, the competency method was more stringent in determining pass/fail, compared with other standard-setting methods, except for the modified Angoff method. The fit of competency data to the Rasch model provides evidence for the validity and reliability of pass/fail decisions.


PLoS ONE ◽  
2021 ◽  
Vol 16 (11) ◽  
pp. e0257871
Author(s):  
Tabea Feseker ◽  
Timo Gnambs ◽  
Cordula Artelt

In order to draw pertinent conclusions about persons with low reading skills, it is essential to use validated standard-setting procedures by which they can be assigned to their appropriate level of proficiency. Since there is no standard-setting procedure without weaknesses, external validity studies are essential. Traditionally, studies have assessed validity by comparing different judgement-based standard-setting procedures. Only a few studies have used model-based approaches for validating judgement-based procedures. The present study addressed this shortcoming and compared agreement of the cut score placement between a judgement-based approach (i.e., Bookmark procedure) and a model-based one (i.e., constrained mixture Rasch model). This was performed by differentiating between individuals with low reading proficiency and those with a functional level of reading proficiency in three independent samples of the German National Educational Panel Study that included students from the ninth grade (N = 13,897) as well as adults (Ns = 5,335 and 3,145). The analyses showed quite similar mean cut scores for the two standard-setting procedures in two of the samples, whereas the third sample showed more pronounced differences. Importantly, these findings demonstrate that model-based approaches provide a valid and resource-efficient alternative for external validation, although they can be sensitive to the ability distribution within a sample.


2018 ◽  
Vol 122 (2) ◽  
pp. 748-772 ◽  
Author(s):  
Wen-Ta Tseng ◽  
Tzi-Ying Su ◽  
John-Michael L. Nix

This study applied the many-facet Rasch model to assess learners’ translation ability in an English as a foreign language context. Few attempts have been made in extant research to detect and calibrate rater severity in the domain of translation testing. To fill the research gap, this study documented the process of validating a test of Chinese-to-English sentence translation and modeled raters’ scoring propensity defined by harshness or leniency, expert/novice effects on severity, and concomitant effects on item difficulty. Two hundred twenty-five, third-year senior high school Taiwanese students and six educators from tertiary and secondary educational institutions served as participants. The students’ mean age was 17.80 years ( SD = 1.20, range 17–19). The exam consisted of 10 translation items adapted from two entrance exam tests. The results showed that this subjectively scored performance assessment exhibited robust unidimensionality, thus reliably measuring translation ability free from unmodeled disturbances. Furthermore, discrepancies in ratings between novice and expert raters were also identified and modeled by the many-facet Rasch model. The implications for applying the many-facet Rasch model in translation tests at the tertiary level were discussed.


2017 ◽  
Vol 22 (3) ◽  
pp. 377-393 ◽  
Author(s):  
D. Gregory Springer ◽  
Kelly D. Bradley

Prior research indicates mixed findings regarding the consistency of adjudicators’ ratings at large ensemble festivals, yet the results of these festivals have strong impacts on the perceived success of instrumental music programs and the perceived effectiveness of their directors. In this study, Rasch modeling was used to investigate the potential influence of adjudicators on performance ratings at a live large ensemble festival. Evaluation forms from a junior high school concert band festival adjudicated by a panel of three expert judges were analyzed using the Many-Facets Rasch Model. Analyses revealed several trends. First, the use of assigning “half points” between adjacent response options on the 5-point rating scale resulted in redundancy and measurement noise. Second, adjudicators provided relatively similar ratings for conceptually distinct criteria, which could be evidence of a halo effect. Third, although all judges demonstrated relatively lenient ratings overall, one judge provided more severe ratings as compared to peers. Finally, an exploratory interaction analysis among the facets of judges and bands indicated the presence of rater-mediated bias. Implications for music researchers and ensemble adjudicators are discussed in the context of ensemble performance evaluations, and a measurement framework that can be applied to other aspects of music performance evaluations is introduced.


2015 ◽  
Vol 43 (2) ◽  
pp. 299-316 ◽  
Author(s):  
Sonia Ferreira Lopes Toffoli ◽  
Dalton Francisco de Andrade ◽  
Antonio Cezar Bornia
Keyword(s):  

Sign in / Sign up

Export Citation Format

Share Document