scholarly journals Using Differential Item Functioning to Test for Interrater Reliability in Constructed Response Items

2020 ◽  
Vol 80 (4) ◽  
pp. 808-820
Author(s):  
Cindy M. Walker ◽  
Sakine Göçer Şahin

The purpose of this study was to investigate a new way of evaluating interrater reliability that can allow one to determine if two raters differ with respect to their rating on a polytomous rating scale or constructed response item. Specifically, differential item functioning (DIF) analyses were used to assess interrater reliability and compared with traditional interrater reliability measures. Three different procedures that can be used as measures of interrater reliability were compared: (1) intraclass correlation coefficient (ICC), (2) Cohen’s kappa statistic, and (3) DIF statistic obtained from Poly-SIBTEST. The results of this investigation indicated that DIF procedures appear to be a promising alternative to assess the interrater reliability of constructed response items, or other polytomous types of items, such as rating scales. Furthermore, using DIF to assess interrater reliability does not require a fully crossed design and allows one to determine if a rater is either more severe, or more lenient, in their scoring of each individual polytomous item on a test or rating scale.

2018 ◽  
Vol 10 (3) ◽  
pp. 269-275 ◽  
Author(s):  
Shalini T. Reddy ◽  
Ara Tekian ◽  
Steven J. Durning ◽  
Shanu Gupta ◽  
Justin Endo ◽  
...  

ABSTRACT Background  Minimally anchored Standard Rating Scales (SRSs), which are widely used in medical education, are hampered by suboptimal interrater reliability. Expert-derived frameworks, such as the Accreditation Council for Graduate Medical Education (ACGME) Milestones, may be helpful in defining level-specific anchors to use on rating scales. Objective  We examined validity evidence for a Milestones-Based Rating Scale (MBRS) for scoring chart-stimulated recall (CSR). Methods  Two 11-item scoring forms with either an MBRS or SRS were developed. Items and anchors for the MBRS were adapted from the ACGME Internal Medicine Milestones. Six CSR standardized videos were developed. Clinical faculty scored videos using either the MBRS or SRS and following a randomized crossover design. Reliability of the MBRS versus the SRS was compared using intraclass correlation. Results  Twenty-two faculty were recruited for instrument testing. Some participants did not complete scoring, leaving a response rate of 15 faculty (7 in the MBRS group and 8 in the SRS group). A total of 529 ratings (number of items × number of scores) using SRSs and 540 using MBRSs were available. Percent agreement was higher for MBRSs for only 2 of 11 items—use of consultants (92 versus 75, P = .019) and unique characteristics of patients (96 versus 79, P = .011)—and the overall score (89 versus 82, P < .001). Interrater agreement was 0.61 for MBRSs and 0.51 for SRSs. Conclusions  Adding milestones to our rating form resulted in significant, but not substantial, improvement in intraclass correlation coefficient. Improvement was inconsistent across items.


1995 ◽  
Vol 11 (1) ◽  
pp. 14-20 ◽  
Author(s):  
Sean M. Hammond

This paper presents an IRT analysis of the Beck Depression Inventory which was carried out to assess the assumption of an underlying latent trait common to non-clinical and patient samples. A one parameter rating scale model was fitted to data drawn from a patient and non-patient sample. Findings suggest that while the BDI fits the model reasonably well for the two samples separately there is sufficient differential item functioning to raise serious duobts of the viability of using it analogously with patient and non-patient groups.


2017 ◽  
Vol 5 (1) ◽  
pp. 59-68 ◽  
Author(s):  
Pauli Olavi Rintala ◽  
Arja Kaarina Sääkslahti ◽  
Susanna Iivonen

This study examined the intrarater and interrater reliability of the Test of Gross Motor Development—3rd Edition (TGMD-3). Participants were 60 Finnish children aged between 3 and 9 years, divided into three separate samples of 20. Two samples of 20 were used to examine the intrarater reliability of two different assessors, and the third sample of 20 was used to establish interrater reliability. Children’s TGMD-3 performances were video-recorded and later assessed using an intraclass correlation coefficient, a kappa statistic, and a percent agreement calculation. The intrarater reliability of the locomotor subtest, ball skills subtest, and gross motor total score ranged from 0.69 to 0.77, and percent agreement ranged from 87 to 91%. The interrater reliability of the locomotor subtest, ball skills subtest, and gross motor total score ranged from 0.56 to 0.64. Percent agreement of 83% was observed for locomotor skills, ball skills, and total skills, respectively. Hop, horizontal jump, and two-hand strike assessments showed the most difference between the assessors. These results show acceptable reliability for the TGMD-3 to analyze children’s gross motor skills.


Author(s):  
Linye Jing ◽  
Maria I. Grigos

Purpose: Forming accurate and consistent speech judgments can be challenging when working with children with speech sound disorders who produce a large number and varied types of error patterns. Rating scales offer a systematic approach to assessing the whole word rather than individual sounds. Thus, these scales can be an efficient way for speech-language pathologists (SLPs) to monitor treatment progress. This study evaluated the interrater reliability of an existing 3-point rating scale using a large group of SLPs as raters. Method: Utilizing an online platform, 30 SLPs completed a brief training and then rated single words produced by children with typical speech patterns and children with speech sound disorders. Words were closely balanced across the three rating categories of the scale. The interrater reliability of the SLPs ratings to a consensus judgment was examined. Results: The majority of SLPs (87%) reached substantial interrater reliability to a consensus judgment using the 3-point rating scale. Correct productions had the highest interrater reliability. Productions with extensive errors had higher agreement than those with minor errors. Certain error types, such as vowel distortions, were especially challenging for SLPs to judge. Conclusions: This study demonstrated substantial interrater reliability to a consensus judgment among a large majority of 30 SLPs using a 3-point rating. The clinical implications of the findings are discussed along with proposed modifications to the training procedure to guide future research.


2019 ◽  
Vol 5 (1) ◽  
pp. e000541 ◽  
Author(s):  
John Ressman ◽  
Wilhelmus Johannes Andreas Grooten ◽  
Eva Rasmussen Barr

Single leg squat (SLS) is a common tool used in clinical examination to set and evaluate rehabilitation goals, but also to assess lower extremity function in active people.ObjectivesTo conduct a review and meta-analysis on the inter-rater and intrarater reliability of the SLS, including the lateral step-down (LSD) and forward step-down (FSD) tests.DesignReview with meta-analysis.Data sourcesCINAHL, Cochrane Library, Embase, Medline (OVID) and Web of Science was searched up until December 2018.Eligibility criteriaStudies were eligible for inclusion if they were methodological studies which assessed the inter-rater and/or intrarater reliability of the SLS, FSD and LSD through observation of movement quality.ResultsThirty-one studies were included. The reliability varied largely between studies (inter-rater: kappa/intraclass correlation coefficients (ICC) = 0.00–0.95; intrarater: kappa/ICC = 0.13–1.00), but most of the studies reached ‘moderate’ measures of agreement. The pooled results of ICC/kappa showed a ‘moderate’ agreement for inter-rater reliability, 0.58 (95% CI 0.50 to 0.65), and a ‘substantial’ agreement for intrarater reliability, 0.68 (95% CI 0.60 to 0.74). Subgroup analyses showed a higher pooled agreement for inter-rater reliability of ≤3-point rating scales while no difference was found for different numbers of segmental assessments.ConclusionOur findings indicate that the SLS test including the FSD and LSD tests can be suitable for clinical use regardless of number of observed segments and particularly with a ≤3-point rating scale. Since most of the included studies were affected with some form of methodological bias, our findings must be interpreted with caution.PROSPERO registration numberCRD42018077822.


2013 ◽  
Vol 38 (1) ◽  
pp. 31-37 ◽  
Author(s):  
AK Kolb ◽  
K Schmied ◽  
P Faßheber ◽  
R Heinrich-Weltzien

Objective: The aim of this video-based study was to examine the taste acceptance of children between the ages of 2 and 5 years regarding highly concentrated fluoride preparations in kindergarten-based preventive programs. Study design: The fluoride preparation Duraphat was applied to 16 children, Elmex fluid to 15 children, and Fluoridin N5 to 14 children. The procedure was conducted according to a standardized protocol and videotaped. Three raters evaluated the children's nonverbal behavior as a measure of taste acceptance on the Frankl Behavior Rating Scale. The interrater reliability (intraclass correlation coefficient; ICC) was .86. In an interview, children indicated the taste of the fluoride preparations on a three-point “smiley” rating scale. The interviewer used a hand puppet during the survey to establish confidence between the children and examiners. Results: Children's nonverbal behavior was significantly more positive after Fluoridin N5 and Duraphat were applied compared to the application of Elmex fluid. The same trend was found during the smiley assessment. The response of children who displayed cooperative positive behavior before the application of fluoride preparations was significantly more positive than those who displayed uncooperative negative behavior. Conclusion: To achieve a high acceptance of the application of fluoride preparations among preschool children, flavorful preparations should be used.


1969 ◽  
Vol 25 (2) ◽  
pp. 399-406
Author(s):  
S. Thomas Friedman ◽  
Richard F. Purnell ◽  
Edward E. Gotts

The purpose was to use adult participant observers to create a scale for assessing some salient personality variables of children and young adolescents living together in close quarters. The 91 children were summer campers of both sexes (8 to 15 yr.). Counselors of these children were the adult participant observers. At least two counselors rated each camper on a 49-item rating scale. Interrater reliability was determined and composite ratings of the campers were factor analyzed. Seven factors accounted for the behaviors on the rating scales. These factors were consistent with and comparable to the constructs that were introduced into the items on the rating scale, e.g., Peer Orientation, Ego Strength, Interaction Potential, Adult Orientation, Rebelliousness, and Rigidity.


2008 ◽  
Vol 192 (1) ◽  
pp. 52-58 ◽  
Author(s):  
Janet B. W. Williams ◽  
Kenneth A. Kobak

BackgroundThe Montgomery-Åsberg Depression Rating Scale (MADRS) is often used in clinical trials to select patients and to assess treatment efficacy. The scale was originally published without suggested questions for clinicians to use in gathering the information necessary to rate the items. Structured and semi-structured interview guides have been found to improve reliability with other scales.AimsTo describe the development and test-retest reliability of a structured interview guide for the MADRS (SIGMA).MethodA total of 162 test-retest interviews were conducted by 81 rater pairs. Each patient was interviewed twice, once by each rater conducting an independent interview.ResultsThe intraclass correlation for total score between raters using the SIGMA was r = 0.93, P < 0.0001. All ten items had good to excellent interrater reliability.ConclusionsUse of the SIGMA can result in high reliability of MADRS scores in evaluating patients with depression.


Sign in / Sign up

Export Citation Format

Share Document