rater severity
Recently Published Documents


TOTAL DOCUMENTS

32
(FIVE YEARS 11)

H-INDEX

5
(FIVE YEARS 1)

2021 ◽  
pp. 001316442110432
Author(s):  
Kuan-Yu Jin ◽  
Thomas Eckes

Performance assessments heavily rely on human ratings. These ratings are typically subject to various forms of error and bias, threatening the assessment outcomes’ validity and fairness. Differential rater functioning (DRF) is a special kind of threat to fairness manifesting itself in unwanted interactions between raters and performance- or construct-irrelevant factors (e.g., examinee gender, rater experience, or time of rating). Most DRF studies have focused on whether raters show differential severity toward known groups of examinees. This study expands the DRF framework and investigates the more complex case of dual DRF effects, where DRF is simultaneously present in rater severity and centrality. Adopting a facets modeling approach, we propose the dual DRF model (DDRFM) for detecting and measuring these effects. In two simulation studies, we found that dual DRF effects (a) negatively affected measurement quality and (b) can reliably be detected and compensated under the DDRFM. Using sample data from a large-scale writing assessment ( N = 1,323), we demonstrate the practical measurement consequences of the dual DRF effects. Findings have implications for researchers and practitioners assessing the psychometric quality of ratings.


Author(s):  
Daniel R. Isbell ◽  
Young-A Son

Abstract Elicited Imitation Tests (EITs) are commonly used in second language acquisition (SLA)/bilingualism research contexts to assess the general oral proficiency of study participants. While previous studies have provided valuable EIT construct-related validity evidence, some key gaps remain. This study uses an integrative data analysis to further probe the validity of the Korean EIT score interpretations by examining the performances of 318 Korean learners (198 second language, 79 foreign language, and 41 heritage) on the Korean EIT scored by five different raters. Expanding on previous EIT validation efforts, this study (a) examined both inter-rater reliability and differences in rater severity, (b) explored measurement bias across subpopulations of language learners, (c) identified relevant linguistic features which relate to item difficulty, and (d) provided a norm-referenced interpretation for Korean EIT scores. Overall, findings suggest that the Korean EIT can be used in diverse SLA/bilingualism research contexts, as it measures ability similarly across subgroups and raters.


Author(s):  
Amir Rezaei ◽  
Khaled Barkaoui

Abstract This study aimed to compare second-language (L2) students’ ratings of their peers’ essays on multiple criteria with those of their teachers’ under different assessment conditions. Forty EFL teachers and 40 EFL students took part in the study. They each rated one essay on five criteria twice, under high-stakes and low-stakes assessment conditions. Multifaceted Rasch Analysis and correlation analyses were conducted to compare rater severity and consistency across rater groups, rating criteria and assessment conditions. The results revealed that there was more variation in students’ ratings than the teachers’ across assessment conditions. Additionally, both rater groups had different degrees of severity in assessing different criteria. In general, students were significantly more severe on language use than were teachers; whereas teachers were significantly more severe than were peers on organization. Student and teacher severity also varied across rating criteria and assessment conditions. The findings of this study have implications for planning and implementing peer assessment in the L2 writing classroom as well as for future research.


Author(s):  
Masaki Uto

Abstract Performance assessments, in which human raters assess examinee performance in practical tasks, have attracted much attention in various assessment contexts involving measurement of higher-order abilities. However, difficulty persists in that ability measurement accuracy strongly depends on rater and task characteristics such as rater severity and task difficulty. To resolve this problem, various item response theory (IRT) models incorporating rater and task parameters, including many-facet Rasch models (MFRMs), have been proposed. When applying such IRT models to datasets comprising results of multiple performance tests administered to different examinees, test linking is needed to unify the scale for model parameters estimated from individual test results. In test linking, test administrators generally need to design multiple tests such that raters and tasks partially overlap. The accuracy of linking under this design is highly reliant on the numbers of common raters and tasks. However, the numbers of common raters and tasks required to ensure high accuracy in test linking remain unclear, making it difficult to determine appropriate test designs. We therefore empirically evaluate the accuracy of IRT-based performance-test linking under common rater and task designs. Concretely, we conduct evaluations through simulation experiments that examine linking accuracy based on a MFRM while changing numbers of common raters and tasks with various factors that possibly affect linking accuracy.


2020 ◽  
pp. 026553222094096
Author(s):  
Iasonas Lamprianou ◽  
Dina Tsagari ◽  
Nansia Kyriakou

This longitudinal study (2002–2014) investigates the stability of rating characteristics of a large group of raters over time in the context of the writing paper of a national high-stakes examination. The study uses one measure of rater severity and two measures of rater consistency. The results suggest that the rating characteristics of individual raters are not stable. Thus, predictions from one administration to the next are difficult, although not impossible. In fact, as the membership of the group of raters changes from year to year, past data on rating characteristics become less useful. When the membership of the group of raters is retained, the community of raters develops more stable characteristics. However, “cultural shocks” (low retention of raters and large numbers of newcomers) destabilize the rating characteristics of the community and predictions become more difficult. We propose practical measures to increase the stability of rating across time and offer methodological suggestions for more efficient rater effect-related research designs and analyses.


2020 ◽  
Vol 8 (1) ◽  
pp. 3-29
Author(s):  
Pia Sundqvist ◽  
Erica Sandlund ◽  
Gustaf B. Skar ◽  
Michael Tengberg

The main objective of this study was to examine whether a Rater Identity Development (RID) program would increase interrater reliability and improve calibration of scores against benchmarks in the assessment of second/foreign language English oral proficiency. Eleven primary school teachers-as-raters participated. A pretest–intervention/RID–posttest design was employed and data included 220 assessments of student performances. Two types of rater-reliability analyses were conducted: first, estimates of the intraclass correlation coefficient two-way random effects model, in order to indicate the extent to which raters were consistent in their rankings, and second, a many-facet Rasch measurement analysis, extended through FACETS®, to explore variation regarding systematic differences of rater severity/leniency. Results showed improvement in terms of consistency, presumably as a result of training; simultaneously, the differences in severity became greater. Results suggest that future rater training may draw on central components of RID, such as core concepts in language assessment, individual feedback, and social moderation work.


2019 ◽  
Vol 5 (2) ◽  
pp. 294-323 ◽  
Author(s):  
Charles Nagle

Abstract Researchers have increasingly turned to Amazon Mechanical Turk (AMT) to crowdsource speech data, predominantly in English. Although AMT and similar platforms are well positioned to enhance the state of the art in L2 research, it is unclear if crowdsourced L2 speech ratings are reliable, particularly in languages other than English. The present study describes the development and deployment of an AMT task to crowdsource comprehensibility, fluency, and accentedness ratings for L2 Spanish speech samples. Fifty-four AMT workers who were native Spanish speakers from 11 countries participated in the ratings. Intraclass correlation coefficients were used to estimate group-level interrater reliability, and Rasch analyses were undertaken to examine individual differences in rater severity and fit. Excellent reliability was observed for the comprehensibility and fluency ratings, but indices were slightly lower for accentedness, leading to recommendations to improve the task for future data collection.


Sign in / Sign up

Export Citation Format

Share Document