rater performance
Recently Published Documents


TOTAL DOCUMENTS

26
(FIVE YEARS 4)

H-INDEX

7
(FIVE YEARS 0)

2021 ◽  
Vol 9 (3) ◽  
pp. 225-241
Author(s):  
Alper Şahin

There are several student performance are assessed in Intensive English Programs (IEP) worldwide in each academic year. These student performances are mostly graded by human raters with a certain degree of error. However, the accuracy of these performance assessment is of utmost importance because they feed data into some high stakes decisions about the students and such performance assessments constitute a large number of students’ scores. Therefore, the accuracy of these performance assessments should be given priority by the IEPs. However, when the current rater performance monitors systems which can help the administrators of IEPs to monitor rater performance in performance assessment are away from practicality because they require the use of complex mathematical models and specialized software. A practical and easy to maintain rater performance categorization system is proposed in this paper and it was accompanied by a sample study  Its benefits to the administrators of IEPs and their raters are also discussed besides its practical considerations.


2020 ◽  
Vol 23 (2) ◽  
pp. 73-95
Author(s):  
Peiyu Wang ◽  
Karen Coetzee ◽  
Andrea Strachan ◽  
Sandra Monteiro ◽  
Liying Cheng

Internationally educated nurses’ (IENs) English language proficiency is critical to professional licensure as communication is a key competency for safe practice. The Canadian English Language Benchmark Assessment for Nurses (CELBAN) is Canada’s only Canadian Language Benchmarks (CLB) referenced examination used in the context of healthcare regulation. This high-stakes assessment claims proof of proficiency for IENs seeking licensure in Canada and a measure of public safety for nursing regulators. Understanding the quality of rater performance when examination results are used for high-stakes decisions is crucial to maintaining speaking test quality as it involves judgement, and thus requires strong reliability evidence (Koizumi et al., 2017). This study examined rater performance on the CELBAN Speaking component using a Many-Facets Rasch Measurement (MFRM). Specifically, this study identified CELBAN rater reliability in terms of consistency and severity, rating bias, and use of rating scale. The study was based on a sample of 115 raters across eight test sites in Canada and results on 2698 examinations across four parallel versions. Findings demonstrated relatively high inter-rater reliability and intra-rater reliability, and that CLB-based speaking descriptors (CLB 6-9) provided sufficient information for raters to discriminate examinees’ oral proficiency. There was no influence of test site or test version, offering validity evidence to support test use for high-stakes purposes. Grammar, among the eight speaking criteria, was identified as the most difficult criterion on the scale, and the one demonstrating most rater bias. This study highlights the value of MFRM analysis in rater performance research with implications for rater training. This study is one of the first research studies using MFRM with a CLB-referenced high-stakes assessment within the Canadian context.


Author(s):  
David W. Bracken ◽  
Christopher T. Rotolo

When raters in a 360 Feedback process do not perform as desired, the result can be highly skewed distributions: The data lose their utility, especially when they are to be used for decision-making. We use the ALAMO performance model [Performance = Alignment × (Ability × Motivation × Opportunity)] to dissect the causes and possible solutions for suboptimal rater performance. Using a systems model of 360 Feedback, we analyze three major factors that can determine the quality of 360 data (i.e., Instrument/Content, Process Features, and Rater Characteristics). No two 360 Feedback systems are the same. It follows that no two diagnoses or prescriptions will be the same across the dozens of decisions that must be made in the design and implementation of a given process. Some of those decisions can be guided by science, some by the unique organization and its leaders, and most by a combination of both. We strongly recommend that both groups of stakeholders (scientist practitioners and process owners) study this chapter prior to making those decisions.


2018 ◽  
Vol 8 (1) ◽  
Author(s):  
Lan-fen Huang ◽  
Simon Kubelec ◽  
Nicole Keng ◽  
Lung-hsun Hsu

2018 ◽  
Vol 47 (8) ◽  
pp. 492-501 ◽  
Author(s):  
Mark C. White

Raters must score accurately and consistently for classroom observation scores to be valid. This requires (a) a standard defining when scoring is accurate and consistent enough and (b) measuring and remediating rater performance against that standard. Current practice has focused on this second problem to the exclusion of the first. My goal here is to start a discussion about identifying a clear, explicit standard that ensures observation scores reflect a consistent view of teaching quality, rather than raters’ idiosyncratic perspectives. In doing so, I connect current certification test cut-scores, the current practice most analogous to a standard, to explicit rater standards, highlighting both the inadequacy of cut-scores and the low standards implicit to current practice.


2018 ◽  
Author(s):  
Emily Zhang ◽  
Vivian Leung ◽  
Daniel SJ Pang

Rodent grimace scales facilitate assessment of spontaneous pain and can identify a range of acute pain levels. Reported rater training in using these scales varies considerably and may contribute to observed variability in inter-rater reliability. This study evaluated the effect of training on inter-rater reliability with the Rat Grimace Scale (RGS). Two training sets, of 42 and 150 images, were prepared from several acute pain models. Four trainee raters progressed through 2 rounds of training, first scoring 42 images (S1) followed by 150 images (S2a). After each round, trainees reviewed the RGS and any problematic images with an experienced rater. The 150 images were then re-scored (S2b). Four years after training, all trainees re-scored the 150 images (S2c). Inter- and intra-rater reliability was evaluated using the intra-class correlation coefficient (ICC) and ICCs compared with a Feldt test. Inter-rater reliability increased from moderate (0.58 [95%CI: 0.43-0.72]) to very good (0.85 [0.81-0.88]) between S1 and S2b (p < 0.01) and also increased between S2a and S2b (p < 0.01). The action units with the highest and lowest ICCs at S2b were orbital tightening (0.84 [0.80-0.87]) and whiskers (0.63 [0.57-0.70]), respectively. In comparison to an experienced rater the ICCs for all trainees improved, ranging from 0.88 to 0.91 at S2b. Four years later, very good inter-rater reliability was retained (0.82 [0.76-0.84]) and intra-rater reliability was good or very good (0.78-0.87). Training improves inter-rater reliability between trainees, with an associated reduction in 95%CI. Additionally, training resulted in improved inter-rater reliability alongside an experienced rater. Performance was retained after several years. The beneficial effects of training potentially reduce data variability and improve experimental animal welfare.


2018 ◽  
Author(s):  
Emily Zhang ◽  
Vivian Leung ◽  
Daniel SJ Pang

Rodent grimace scales facilitate assessment of spontaneous pain and can identify a range of acute pain levels. Reported rater training in using these scales varies considerably and may contribute to observed variability in inter-rater reliability. This study evaluated the effect of training on inter-rater reliability with the Rat Grimace Scale (RGS). Two training sets, of 42 and 150 images, were prepared from several acute pain models. Four trainee raters progressed through 2 rounds of training, first scoring 42 images (S1) followed by 150 images (S2a). After each round, trainees reviewed the RGS and any problematic images with an experienced rater. The 150 images were then re-scored (S2b). Four years after training, all trainees re-scored the 150 images (S2c). Inter- and intra-rater reliability was evaluated using the intra-class correlation coefficient (ICC) and ICCs compared with a Feldt test. Inter-rater reliability increased from moderate (0.58 [95%CI: 0.43-0.72]) to very good (0.85 [0.81-0.88]) between S1 and S2b (p < 0.01) and also increased between S2a and S2b (p < 0.01). The action units with the highest and lowest ICCs at S2b were orbital tightening (0.84 [0.80-0.87]) and whiskers (0.63 [0.57-0.70]), respectively. In comparison to an experienced rater the ICCs for all trainees improved, ranging from 0.88 to 0.91 at S2b. Four years later, very good inter-rater reliability was retained (0.82 [0.76-0.84]) and intra-rater reliability was good or very good (0.78-0.87). Training improves inter-rater reliability between trainees, with an associated reduction in 95%CI. Additionally, training resulted in improved inter-rater reliability alongside an experienced rater. Performance was retained after several years. The beneficial effects of training potentially reduce data variability and improve experimental animal welfare.


2017 ◽  
Vol 3 ◽  
pp. 7-19
Author(s):  
Pilvi Alp ◽  
Anu Epner ◽  
Hille Pajupuu

Assessment reliability is vital in language testing. We have studied the influence of empathy, age and experience on the assessment of the writing component in Estonian Language proficiency examinations at levels A2–C1, and the effect of the rater properties on rater performance at different language levels. The study included 5,270 examination papers, each assessed by two raters. Raters were aged 34–73 and had a rating experience of 3–15 years. The empathy level (EQ) of all 26 A2–C1 raters had previously been measured by Baron-Cohen and Wheelwright’s self-report questionnaire. The results of the correlation analysis indicated that in case of regular training (and with three or more years of experience), the rater’s level of empathy, age and experience did not have a significant effect on the score.


Sign in / Sign up

Export Citation Format

Share Document