rater severity Latest Research Papers

2021 ◽

pp. 001316442110432

Author(s):

Kuan-Yu Jin ◽

Thomas Eckes

Keyword(s):

Writing Assessment ◽

Large Scale ◽

Special Kind ◽

Performance Assessments ◽

Complex Case ◽

Sample Data ◽

Rater Severity ◽

And Performance ◽

Psychometric Quality

Performance assessments heavily rely on human ratings. These ratings are typically subject to various forms of error and bias, threatening the assessment outcomes’ validity and fairness. Differential rater functioning (DRF) is a special kind of threat to fairness manifesting itself in unwanted interactions between raters and performance- or construct-irrelevant factors (e.g., examinee gender, rater experience, or time of rating). Most DRF studies have focused on whether raters show differential severity toward known groups of examinees. This study expands the DRF framework and investigates the more complex case of dual DRF effects, where DRF is simultaneously present in rater severity and centrality. Adopting a facets modeling approach, we propose the dual DRF model (DDRFM) for detecting and measuring these effects. In two simulation studies, we found that dual DRF effects (a) negatively affected measurement quality and (b) can reliably be detected and compensated under the DDRFM. Using sample data from a large-scale writing assessment ( N = 1,323), we demonstrate the practical measurement consequences of the dual DRF effects. Findings have implications for researchers and practitioners assessing the psychometric quality of ratings.

Download Full-text

MEASUREMENT PROPERTIES OF A STANDARDIZED ELICITED IMITATION TEST: AN INTEGRATIVE DATA ANALYSIS

Studies in Second Language Acquisition ◽

10.1017/s0272263121000383 ◽

2021 ◽

pp. 1-27

Author(s):

Daniel R. Isbell ◽

Young-A Son

Keyword(s):

Second Language ◽

Data Analysis ◽

Second Language Acquisition ◽

Item Difficulty ◽

Measurement Properties ◽

Linguistic Features ◽

Elicited Imitation ◽

Integrative Data Analysis ◽

Rater Severity ◽

Study Participants

Abstract Elicited Imitation Tests (EITs) are commonly used in second language acquisition (SLA)/bilingualism research contexts to assess the general oral proficiency of study participants. While previous studies have provided valuable EIT construct-related validity evidence, some key gaps remain. This study uses an integrative data analysis to further probe the validity of the Korean EIT score interpretations by examining the performances of 318 Korean learners (198 second language, 79 foreign language, and 41 heritage) on the Korean EIT scored by five different raters. Expanding on previous EIT validation efforts, this study (a) examined both inter-rater reliability and differences in rater severity, (b) explored measurement bias across subpopulations of language learners, (c) identified relevant linguistic features which relate to item difficulty, and (d) provided a norm-referenced interpretation for Korean EIT scores. Overall, findings suggest that the Korean EIT can be used in diverse SLA/bilingualism research contexts, as it measures ability similarly across subgroups and raters.

Download Full-text

The Influence of Raters’ Topic Familiarity on Rater Severity in a Teaching Simulation Test for International Teaching Assistants

The Korea English Language Testing Association ◽

10.37244/ela.2020.15.2.235 ◽

2020 ◽

Vol 15 ◽

pp. 235-254

Author(s):

Yongkook Won

Keyword(s):

Teaching Assistants ◽

International Teaching Assistants ◽

Simulation Test ◽

Topic Familiarity ◽

International Teaching ◽

Rater Severity

Download Full-text

Peer and teacher assessment of second-language writing in high- and low-stakes conditions

ITL - International Journal of Applied Linguistics ◽

10.1075/itl.20006.rez ◽

2020 ◽

Author(s):

Amir Rezaei ◽

Khaled Barkaoui

Keyword(s):

Second Language ◽

Second Language Writing ◽

Peer Assessment ◽

Teacher Assessment ◽

Future Research ◽

High Stakes ◽

Language Writing ◽

Efl Students ◽

Rater Severity ◽

Rating Criteria

Abstract This study aimed to compare second-language (L2) students’ ratings of their peers’ essays on multiple criteria with those of their teachers’ under different assessment conditions. Forty EFL teachers and 40 EFL students took part in the study. They each rated one essay on five criteria twice, under high-stakes and low-stakes assessment conditions. Multifaceted Rasch Analysis and correlation analyses were conducted to compare rater severity and consistency across rater groups, rating criteria and assessment conditions. The results revealed that there was more variation in students’ ratings than the teachers’ across assessment conditions. Additionally, both rater groups had different degrees of severity in assessing different criteria. In general, students were significantly more severe on language use than were teachers; whereas teachers were significantly more severe than were peers on organization. Student and teacher severity also varied across rating criteria and assessment conditions. The findings of this study have implications for planning and implementing peer assessment in the L2 writing classroom as well as for future research.

Download Full-text

Accuracy of performance-test linking based on a many-facet Rasch model

Behavior Research Methods ◽

10.3758/s13428-020-01498-x ◽

2020 ◽

Author(s):

Masaki Uto

Keyword(s):

Performance Test ◽

Performance Assessments ◽

Task Characteristics ◽

Model Parameters ◽

Test Results ◽

Irt Models ◽

Test Linking ◽

Rater Severity ◽

Task Parameters ◽

And Task

Abstract Performance assessments, in which human raters assess examinee performance in practical tasks, have attracted much attention in various assessment contexts involving measurement of higher-order abilities. However, difficulty persists in that ability measurement accuracy strongly depends on rater and task characteristics such as rater severity and task difficulty. To resolve this problem, various item response theory (IRT) models incorporating rater and task parameters, including many-facet Rasch models (MFRMs), have been proposed. When applying such IRT models to datasets comprising results of multiple performance tests administered to different examinees, test linking is needed to unify the scale for model parameters estimated from individual test results. In test linking, test administrators generally need to design multiple tests such that raters and tasks partially overlap. The accuracy of linking under this design is highly reliant on the numbers of common raters and tasks. However, the numbers of common raters and tasks required to ensure high accuracy in test linking remain unclear, making it difficult to determine appropriate test designs. We therefore empirically evaluate the accuracy of IRT-based performance-test linking under common rater and task designs. Concretely, we conduct evaluations through simulation experiments that examine linking accuracy based on a MFRM while changing numbers of common raters and tasks with various factors that possibly affect linking accuracy.

Download Full-text

The longitudinal stability of rating characteristics in an EFL examination: Methodological and substantive considerations

Language Testing ◽

10.1177/0265532220940960 ◽

2020 ◽

pp. 026553222094096

Author(s):

Iasonas Lamprianou ◽

Dina Tsagari ◽

Nansia Kyriakou

Keyword(s):

Longitudinal Study ◽

High Stakes ◽

Related Research ◽

Large Numbers ◽

Research Designs ◽

Two Measures ◽

Past Data ◽

Rater Severity ◽

The Stability ◽

Over Time

This longitudinal study (2002–2014) investigates the stability of rating characteristics of a large group of raters over time in the context of the writing paper of a national high-stakes examination. The study uses one measure of rater severity and two measures of rater consistency. The results suggest that the rating characteristics of individual raters are not stable. Thus, predictions from one administration to the next are difficult, although not impossible. In fact, as the membership of the group of raters changes from year to year, past data on rating characteristics become less useful. When the membership of the group of raters is retained, the community of raters develops more stable characteristics. However, “cultural shocks” (low retention of raters and large numbers of newcomers) destabilize the rating characteristics of the community and predictions become more difficult. We propose practical measures to increase the stability of rating across time and offer methodological suggestions for more efficient rater effect-related research designs and analyses.

Download Full-text

Effects of Rater Training on the Assessment of L2 English Oral Proficiency

Nordic Journal of Modern Language Methodology ◽

10.46364/njmlm.v8i1.605 ◽

2020 ◽

Vol 8 (1) ◽

pp. 3-29

Author(s):

Pia Sundqvist ◽

Erica Sandlund ◽

Gustaf B. Skar ◽

Michael Tengberg

Keyword(s):

Identity Development ◽

Intraclass Correlation ◽

Oral Proficiency ◽

Rater Training ◽

Primary School Teachers ◽

Measurement Analysis ◽

Individual Feedback ◽

Core Concepts ◽

Rater Severity ◽

Social Moderation

The main objective of this study was to examine whether a Rater Identity Development (RID) program would increase interrater reliability and improve calibration of scores against benchmarks in the assessment of second/foreign language English oral proficiency. Eleven primary school teachers-as-raters participated. A pretest–intervention/RID–posttest design was employed and data included 220 assessments of student performances. Two types of rater-reliability analyses were conducted: first, estimates of the intraclass correlation coefficient two-way random effects model, in order to indicate the extent to which raters were consistent in their rankings, and second, a many-facet Rasch measurement analysis, extended through FACETS®, to explore variation regarding systematic differences of rater severity/leniency. Results showed improvement in terms of consistency, presumably as a result of training; simultaneously, the differences in severity became greater. Results suggest that future rater training may draw on central components of RID, such as core concepts in language assessment, individual feedback, and social moderation work.

Download Full-text

Analyzing rater severity in a freshman composition course using many facet Rasch measurement

Language Testing in Asia ◽

10.1186/s40468-020-0098-3 ◽

2020 ◽

Vol 10 (1) ◽

Author(s):

Inan Deniz Erguvan ◽

Beyza Aksu Dunya

Keyword(s):

Freshman Composition ◽

Rasch Measurement ◽

Composition Course ◽

Rater Severity

Download Full-text

Examining and controlling rater severity and leniency effects on alignment evaluation between science items and science learning indicators using many-facets Rasch modeling

Kasetsart Journal of Social Sciences ◽

10.34044/j.kjss.2020.41.3.22 ◽

2020 ◽

Keyword(s):

Science Learning ◽

Rasch Modeling ◽

Rater Severity

Download Full-text

Developing and validating a methodology for crowdsourcing L2 speech ratings in Amazon Mechanical Turk

Journal of Second Language Pronunciation ◽

10.1075/jslp.18016.nag ◽

2019 ◽

Vol 5 (2) ◽

pp. 294-323 ◽

Cited By ~ 3

Author(s):

Charles Nagle

Keyword(s):

Interrater Reliability ◽

Intraclass Correlation ◽

Correlation Coefficients ◽

Spanish Speakers ◽

Mechanical Turk ◽

Amazon Mechanical Turk ◽

Native Spanish Speakers ◽

Intraclass Correlation Coefficients ◽

Future Data ◽

Rater Severity

Abstract Researchers have increasingly turned to Amazon Mechanical Turk (AMT) to crowdsource speech data, predominantly in English. Although AMT and similar platforms are well positioned to enhance the state of the art in L2 research, it is unclear if crowdsourced L2 speech ratings are reliable, particularly in languages other than English. The present study describes the development and deployment of an AMT task to crowdsource comprehensibility, fluency, and accentedness ratings for L2 Spanish speech samples. Fifty-four AMT workers who were native Spanish speakers from 11 countries participated in the ratings. Intraclass correlation coefficients were used to estimate group-level interrater reliability, and Rasch analyses were undertaken to examine individual differences in rater severity and fit. Excellent reliability was observed for the comprehensibility and fluency ratings, but indices were slightly lower for accentedness, leading to recommendations to improve the task for future data collection.

Download Full-text

rater severity
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

Detecting Differential Rater Functioning in Severity and Centrality: The Dual DRF Facets Model

MEASUREMENT PROPERTIES OF A STANDARDIZED ELICITED IMITATION TEST: AN INTEGRATIVE DATA ANALYSIS

The Influence of Raters’ Topic Familiarity on Rater Severity in a Teaching Simulation Test for International Teaching Assistants

Peer and teacher assessment of second-language writing in high- and low-stakes conditions

Accuracy of performance-test linking based on a many-facet Rasch model

The longitudinal stability of rating characteristics in an EFL examination: Methodological and substantive considerations

Effects of Rater Training on the Assessment of L2 English Oral Proficiency

Analyzing rater severity in a freshman composition course using many facet Rasch measurement

Examining and controlling rater severity and leniency effects on alignment evaluation between science items and science learning indicators using many-facets Rasch modeling

Developing and validating a methodology for crowdsourcing L2 speech ratings in Amazon Mechanical Turk

Export Citation Format

rater severityRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Detecting Differential Rater Functioning in Severity and Centrality: The Dual DRF Facets Model

MEASUREMENT PROPERTIES OF A STANDARDIZED ELICITED IMITATION TEST: AN INTEGRATIVE DATA ANALYSIS

The Influence of Raters’ Topic Familiarity on Rater Severity in a Teaching Simulation Test for International Teaching Assistants

Peer and teacher assessment of second-language writing in high- and low-stakes conditions

Accuracy of performance-test linking based on a many-facet Rasch model

The longitudinal stability of rating characteristics in an EFL examination: Methodological and substantive considerations

Effects of Rater Training on the Assessment of L2 English Oral Proficiency

Analyzing rater severity in a freshman composition course using many facet Rasch measurement

Examining and controlling rater severity and leniency effects on alignment evaluation between science items and science learning indicators using many-facets Rasch modeling

Developing and validating a methodology for crowdsourcing L2 speech ratings in Amazon Mechanical Turk

rater severity
Recently Published Documents