Using Differential Item Functioning to Test for Interrater Reliability in Constructed Response Items

The purpose of this study was to investigate a new way of evaluating interrater reliability that can allow one to determine if two raters differ with respect to their rating on a polytomous rating scale or constructed response item. Specifically, differential item functioning (DIF) analyses were used to assess interrater reliability and compared with traditional interrater reliability measures. Three different procedures that can be used as measures of interrater reliability were compared: (1) intraclass correlation coefficient (ICC), (2) Cohen’s kappa statistic, and (3) DIF statistic obtained from Poly-SIBTEST. The results of this investigation indicated that DIF procedures appear to be a promising alternative to assess the interrater reliability of constructed response items, or other polytomous types of items, such as rating scales. Furthermore, using DIF to assess interrater reliability does not require a fully crossed design and allows one to determine if a rater is either more severe, or more lenient, in their scoring of each individual polytomous item on a test or rating scale.

Download Full-text

Preliminary Validity Evidence for a Milestones-Based Rating Scale for Chart-Stimulated Recall

Journal of Graduate Medical Education ◽

10.4300/jgme-d-17-00435.1 ◽

2018 ◽

Vol 10 (3) ◽

pp. 269-275 ◽

Cited By ~ 1

Author(s):

Shalini T. Reddy ◽

Ara Tekian ◽

Steven J. Durning ◽

Shanu Gupta ◽

Justin Endo ◽

...

Keyword(s):

Medical Education ◽

Interrater Reliability ◽

Rating Scales ◽

Rating Scale ◽

Intraclass Correlation ◽

Substantial Improvement ◽

Validity Evidence ◽

Stimulated Recall ◽

Rating Form ◽

Group A

ABSTRACT Background Minimally anchored Standard Rating Scales (SRSs), which are widely used in medical education, are hampered by suboptimal interrater reliability. Expert-derived frameworks, such as the Accreditation Council for Graduate Medical Education (ACGME) Milestones, may be helpful in defining level-specific anchors to use on rating scales. Objective We examined validity evidence for a Milestones-Based Rating Scale (MBRS) for scoring chart-stimulated recall (CSR). Methods Two 11-item scoring forms with either an MBRS or SRS were developed. Items and anchors for the MBRS were adapted from the ACGME Internal Medicine Milestones. Six CSR standardized videos were developed. Clinical faculty scored videos using either the MBRS or SRS and following a randomized crossover design. Reliability of the MBRS versus the SRS was compared using intraclass correlation. Results Twenty-two faculty were recruited for instrument testing. Some participants did not complete scoring, leaving a response rate of 15 faculty (7 in the MBRS group and 8 in the SRS group). A total of 529 ratings (number of items × number of scores) using SRSs and 540 using MBRSs were available. Percent agreement was higher for MBRSs for only 2 of 11 items—use of consultants (92 versus 75, P = .019) and unique characteristics of patients (96 versus 79, P = .011)—and the overall score (89 versus 82, P < .001). Interrater agreement was 0.61 for MBRSs and 0.51 for SRSs. Conclusions Adding milestones to our rating form resulted in significant, but not substantial, improvement in intraclass correlation coefficient. Improvement was inconsistent across items.

Download Full-text

An IRT Investigation of the Validity of Non-Patient Analogue Research Using the Beck Depression Inventory

European Journal of Psychological Assessment ◽

10.1027/1015-5759.11.1.14 ◽

1995 ◽

Vol 11 (1) ◽

pp. 14-20 ◽

Cited By ~ 22

Author(s):

Sean M. Hammond

Keyword(s):

Differential Item Functioning ◽

Beck Depression Inventory ◽

Rating Scale ◽

Latent Trait ◽

Scale Model ◽

Item Functioning ◽

Two Samples ◽

Patient Groups ◽

Analogue Research ◽

Rating Scale Model

This paper presents an IRT analysis of the Beck Depression Inventory which was carried out to assess the assumption of an underlying latent trait common to non-clinical and patient samples. A one parameter rating scale model was fitted to data drawn from a patient and non-patient sample. Findings suggest that while the BDI fits the model reasonably well for the two samples separately there is sufficient differential item functioning to raise serious duobts of the viability of using it analogously with patient and non-patient groups.

Download Full-text

Reliability Assessment of Scores From Video-Recorded TGMD-3 Performances

Journal of Motor Learning and Development ◽

10.1123/jmld.2016-0007 ◽

2017 ◽

Vol 5 (1) ◽

pp. 59-68 ◽

Cited By ~ 16

Author(s):

Pauli Olavi Rintala ◽

Arja Kaarina Sääkslahti ◽

Susanna Iivonen

Keyword(s):

Motor Development ◽

Interrater Reliability ◽

Intraclass Correlation ◽

Kappa Statistic ◽

Intrarater Reliability ◽

Gross Motor ◽

Gross Motor Development ◽

Percent Agreement ◽

Two Samples ◽

Ball Skills

This study examined the intrarater and interrater reliability of the Test of Gross Motor Development—3rd Edition (TGMD-3). Participants were 60 Finnish children aged between 3 and 9 years, divided into three separate samples of 20. Two samples of 20 were used to examine the intrarater reliability of two different assessors, and the third sample of 20 was used to establish interrater reliability. Children’s TGMD-3 performances were video-recorded and later assessed using an intraclass correlation coefficient, a kappa statistic, and a percent agreement calculation. The intrarater reliability of the locomotor subtest, ball skills subtest, and gross motor total score ranged from 0.69 to 0.77, and percent agreement ranged from 87 to 91%. The interrater reliability of the locomotor subtest, ball skills subtest, and gross motor total score ranged from 0.56 to 0.64. Percent agreement of 83% was observed for locomotor skills, ball skills, and total skills, respectively. Hop, horizontal jump, and two-hand strike assessments showed the most difference between the assessors. These results show acceptable reliability for the TGMD-3 to analyze children’s gross motor skills.

Download Full-text

Speech-Language Pathologists' Ratings of Speech Accuracy in Children With Speech Sound Disorders

American Journal of Speech-Language Pathology ◽

10.1044/2021_ajslp-20-00381 ◽

2021 ◽

pp. 1-12

Author(s):

Linye Jing ◽

Maria I. Grigos

Keyword(s):

Interrater Reliability ◽

Rating Scales ◽

Rating Scale ◽

Speech Sound ◽

Future Research ◽

Training Procedure ◽

Speech Sound Disorders ◽

Speech Language Pathologists ◽

Point Rating Scale ◽

Whole Word

Purpose: Forming accurate and consistent speech judgments can be challenging when working with children with speech sound disorders who produce a large number and varied types of error patterns. Rating scales offer a systematic approach to assessing the whole word rather than individual sounds. Thus, these scales can be an efficient way for speech-language pathologists (SLPs) to monitor treatment progress. This study evaluated the interrater reliability of an existing 3-point rating scale using a large group of SLPs as raters. Method: Utilizing an online platform, 30 SLPs completed a brief training and then rated single words produced by children with typical speech patterns and children with speech sound disorders. Words were closely balanced across the three rating categories of the scale. The interrater reliability of the SLPs ratings to a consensus judgment was examined. Results: The majority of SLPs (87%) reached substantial interrater reliability to a consensus judgment using the 3-point rating scale. Correct productions had the highest interrater reliability. Productions with extensive errors had higher agreement than those with minor errors. Certain error types, such as vowel distortions, were especially challenging for SLPs to judge. Conclusions: This study demonstrated substantial interrater reliability to a consensus judgment among a large majority of 30 SLPs using a 3-point rating. The clinical implications of the findings are discussed along with proposed modifications to the training procedure to guide future research.

Download Full-text

Factorial invariance, differential item functioning, and norms of the Orgasm Rating Scale

International Journal of Clinical and Health Psychology ◽

10.1016/j.ijchp.2018.11.001 ◽

2019 ◽

Vol 19 (1) ◽

pp. 57-66 ◽

Cited By ~ 9

Author(s):

Ana Isabel Arcos-Romero ◽

Juan Carlos Sierra

Keyword(s):

Differential Item Functioning ◽

Rating Scale ◽

Factorial Invariance ◽

Item Functioning

Download Full-text

Visual assessment of movement quality in the single leg squat test: a review and meta-analysis of inter-rater and intrarater reliability

BMJ Open Sport & Exercise Medicine ◽

10.1136/bmjsem-2019-000541 ◽

2019 ◽

Vol 5 (1) ◽

pp. e000541 ◽

Cited By ~ 3

Author(s):

John Ressman ◽

Wilhelmus Johannes Andreas Grooten ◽

Eva Rasmussen Barr

Keyword(s):

Rating Scales ◽

Rating Scale ◽

Meta Analysis ◽

Intraclass Correlation ◽

Cochrane Library ◽

Intrarater Reliability ◽

Rater Reliability ◽

Movement Quality ◽

Step Down ◽

Single Leg Squat

Single leg squat (SLS) is a common tool used in clinical examination to set and evaluate rehabilitation goals, but also to assess lower extremity function in active people.ObjectivesTo conduct a review and meta-analysis on the inter-rater and intrarater reliability of the SLS, including the lateral step-down (LSD) and forward step-down (FSD) tests.DesignReview with meta-analysis.Data sourcesCINAHL, Cochrane Library, Embase, Medline (OVID) and Web of Science was searched up until December 2018.Eligibility criteriaStudies were eligible for inclusion if they were methodological studies which assessed the inter-rater and/or intrarater reliability of the SLS, FSD and LSD through observation of movement quality.ResultsThirty-one studies were included. The reliability varied largely between studies (inter-rater: kappa/intraclass correlation coefficients (ICC) = 0.00–0.95; intrarater: kappa/ICC = 0.13–1.00), but most of the studies reached ‘moderate’ measures of agreement. The pooled results of ICC/kappa showed a ‘moderate’ agreement for inter-rater reliability, 0.58 (95% CI 0.50 to 0.65), and a ‘substantial’ agreement for intrarater reliability, 0.68 (95% CI 0.60 to 0.74). Subgroup analyses showed a higher pooled agreement for inter-rater reliability of ≤3-point rating scales while no difference was found for different numbers of segmental assessments.ConclusionOur findings indicate that the SLS test including the FSD and LSD tests can be suitable for clinical use regardless of number of observed segments and particularly with a ≤3-point rating scale. Since most of the included studies were affected with some form of methodological bias, our findings must be interpreted with caution.PROSPERO registration numberCRD42018077822.

Download Full-text

Preschool Children's Taste Acceptance of Highly Concentrated Fluoride Compounds: Effects on Nonverbal Behavior

Journal of Clinical Pediatric Dentistry ◽

10.17796/jcpd.38.1.1501887254xt5u07 ◽

2013 ◽

Vol 38 (1) ◽

pp. 31-37 ◽

Cited By ~ 1

Author(s):

AK Kolb ◽

K Schmied ◽

P Faßheber ◽

R Heinrich-Weltzien

Keyword(s):

Preschool Children ◽

Nonverbal Behavior ◽

Interrater Reliability ◽

Rating Scale ◽

Intraclass Correlation ◽

Positive Behavior ◽

Behavior Rating ◽

Negative Behavior ◽

Behavior Rating Scale ◽

Standardized Protocol

Objective: The aim of this video-based study was to examine the taste acceptance of children between the ages of 2 and 5 years regarding highly concentrated fluoride preparations in kindergarten-based preventive programs. Study design: The fluoride preparation Duraphat was applied to 16 children, Elmex fluid to 15 children, and Fluoridin N5 to 14 children. The procedure was conducted according to a standardized protocol and videotaped. Three raters evaluated the children's nonverbal behavior as a measure of taste acceptance on the Frankl Behavior Rating Scale. The interrater reliability (intraclass correlation coefficient; ICC) was .86. In an interview, children indicated the taste of the fluoride preparations on a three-point “smiley” rating scale. The interviewer used a hand puppet during the survey to establish confidence between the children and examiners. Results: Children's nonverbal behavior was significantly more positive after Fluoridin N5 and Duraphat were applied compared to the application of Elmex fluid. The same trend was found during the smiley assessment. The response of children who displayed cooperative positive behavior before the application of fluoride preparations was significantly more positive than those who displayed uncooperative negative behavior. Conclusion: To achieve a high acceptance of the application of fluoride preparations among preschool children, flavorful preparations should be used.

Download Full-text

Detecting Gender-Based Differential Item Functioning on a Constructed- Response Science Test

Applied Measurement in Education ◽

10.1207/s15324818ame1203_1 ◽

1999 ◽

Vol 12 (3) ◽

pp. 211-235 ◽

Cited By ~ 12

Author(s):

Laura S. Hamilton

Keyword(s):

Differential Item Functioning ◽

Constructed Response ◽

Item Functioning ◽

Science Test ◽

Gender Based

Download Full-text

Factor Analytic Study of Social Behavior in Children

Psychological Reports ◽

10.2466/pr0.1969.25.2.399 ◽

1969 ◽

Vol 25 (2) ◽

pp. 399-406

Author(s):

S. Thomas Friedman ◽

Richard F. Purnell ◽

Edward E. Gotts

Keyword(s):

Interrater Reliability ◽

Rating Scales ◽

Rating Scale ◽

Ego Strength ◽

Personality Variables ◽

Adult Participant ◽

Factor Analytic Study ◽

Living Together ◽

Peer Orientation ◽

Children And Young Adolescents

The purpose was to use adult participant observers to create a scale for assessing some salient personality variables of children and young adolescents living together in close quarters. The 91 children were summer campers of both sexes (8 to 15 yr.). Counselors of these children were the adult participant observers. At least two counselors rated each camper on a 49-item rating scale. Interrater reliability was determined and composite ratings of the campers were factor analyzed. Seven factors accounted for the behaviors on the rating scales. These factors were consistent with and comparable to the constructs that were introduced into the items on the rating scale, e.g., Peer Orientation, Ego Strength, Interaction Potential, Adult Orientation, Rebelliousness, and Rigidity.

Download Full-text

Development and reliability of a structured interview guide for the Montgomery-Åsberg Depression Rating Scale (SIGMA)

The British Journal of Psychiatry ◽

10.1192/bjp.bp.106.032532 ◽

2008 ◽

Vol 192 (1) ◽

pp. 52-58 ◽

Cited By ~ 176

Author(s):

Janet B. W. Williams ◽

Kenneth A. Kobak

Keyword(s):

Clinical Trials ◽

Treatment Efficacy ◽

Interrater Reliability ◽

Rating Scale ◽

High Reliability ◽

Intraclass Correlation ◽

Structured Interview ◽

Interview Guide ◽

Test Retest Reliability ◽

Improve Reliability

BackgroundThe Montgomery-Åsberg Depression Rating Scale (MADRS) is often used in clinical trials to select patients and to assess treatment efficacy. The scale was originally published without suggested questions for clinicians to use in gathering the information necessary to rate the items. Structured and semi-structured interview guides have been found to improve reliability with other scales.AimsTo describe the development and test-retest reliability of a structured interview guide for the MADRS (SIGMA).MethodA total of 162 test-retest interviews were conducted by 81 rater pairs. Each patient was interviewed twice, once by each rater conducting an independent interview.ResultsThe intraclass correlation for total score between raters using the SIGMA was r = 0.93, P < 0.0001. All ten items had good to excellent interrater reliability.ConclusionsUse of the SIGMA can result in high reliability of MADRS scores in evaluating patients with depression.

Download Full-text