Preliminary Validity Evidence for a Milestones-Based Rating Scale for Chart-Stimulated Recall

ABSTRACT Background Minimally anchored Standard Rating Scales (SRSs), which are widely used in medical education, are hampered by suboptimal interrater reliability. Expert-derived frameworks, such as the Accreditation Council for Graduate Medical Education (ACGME) Milestones, may be helpful in defining level-specific anchors to use on rating scales. Objective We examined validity evidence for a Milestones-Based Rating Scale (MBRS) for scoring chart-stimulated recall (CSR). Methods Two 11-item scoring forms with either an MBRS or SRS were developed. Items and anchors for the MBRS were adapted from the ACGME Internal Medicine Milestones. Six CSR standardized videos were developed. Clinical faculty scored videos using either the MBRS or SRS and following a randomized crossover design. Reliability of the MBRS versus the SRS was compared using intraclass correlation. Results Twenty-two faculty were recruited for instrument testing. Some participants did not complete scoring, leaving a response rate of 15 faculty (7 in the MBRS group and 8 in the SRS group). A total of 529 ratings (number of items × number of scores) using SRSs and 540 using MBRSs were available. Percent agreement was higher for MBRSs for only 2 of 11 items—use of consultants (92 versus 75, P = .019) and unique characteristics of patients (96 versus 79, P = .011)—and the overall score (89 versus 82, P < .001). Interrater agreement was 0.61 for MBRSs and 0.51 for SRSs. Conclusions Adding milestones to our rating form resulted in significant, but not substantial, improvement in intraclass correlation coefficient. Improvement was inconsistent across items.

Download Full-text

Using Differential Item Functioning to Test for Interrater Reliability in Constructed Response Items

Educational and Psychological Measurement ◽

10.1177/0013164419899731 ◽

2020 ◽

Vol 80 (4) ◽

pp. 808-820

Author(s):

Cindy M. Walker ◽

Sakine Göçer Şahin

Keyword(s):

Differential Item Functioning ◽

Interrater Reliability ◽

Rating Scales ◽

Rating Scale ◽

Intraclass Correlation ◽

Kappa Statistic ◽

Promising Alternative ◽

Constructed Response ◽

Polytomous Item ◽

Item Functioning

The purpose of this study was to investigate a new way of evaluating interrater reliability that can allow one to determine if two raters differ with respect to their rating on a polytomous rating scale or constructed response item. Specifically, differential item functioning (DIF) analyses were used to assess interrater reliability and compared with traditional interrater reliability measures. Three different procedures that can be used as measures of interrater reliability were compared: (1) intraclass correlation coefficient (ICC), (2) Cohen’s kappa statistic, and (3) DIF statistic obtained from Poly-SIBTEST. The results of this investigation indicated that DIF procedures appear to be a promising alternative to assess the interrater reliability of constructed response items, or other polytomous types of items, such as rating scales. Furthermore, using DIF to assess interrater reliability does not require a fully crossed design and allows one to determine if a rater is either more severe, or more lenient, in their scoring of each individual polytomous item on a test or rating scale.

Download Full-text

Development of a Model for the Acquisition and Assessment of Advanced Laparoscopic Suturing Skills Using an Automated Device

Surgical Innovation ◽

10.1177/1553350618764221 ◽

2018 ◽

Vol 25 (3) ◽

pp. 286-290 ◽

Cited By ~ 2

Author(s):

Elif Bilgic ◽

Madoka Takao ◽

Pepa Kaneva ◽

Satoshi Endo ◽

Toshitatsu Takao ◽

...

Keyword(s):

Laparoscopic Surgery ◽

Interrater Reliability ◽

Intraclass Correlation ◽

Correlation Coefficients ◽

Instructional Video ◽

Validity Evidence ◽

Laparoscopic Suturing ◽

Intraclass Correlation Coefficients ◽

Operative Assessment ◽

Suturing Skills

Background. Needs assessment identified a gap regarding laparoscopic suturing skills targeted in simulation. This study collected validity evidence for an advanced laparoscopic suturing task using an Endo StitchTM device. Methods. Experienced (ES) and novice surgeons (NS) performed continuous suturing after watching an instructional video. Scores were based on time and accuracy, and Global Operative Assessment of Laparoscopic Surgery. Data are shown as medians [25th-75th percentiles] (ES vs NS). Interrater reliability was calculated using intraclass correlation coefficients (confidence interval). Results. Seventeen participants were enrolled. Experienced surgeons had significantly greater task (980 [964-999] vs 666 [391-711], P = .0035) and Global Operative Assessment of Laparoscopic Surgery scores (25 [24-25] vs 14 [12-17], P = .0029). Interrater reliability for time and accuracy were 1.0 and 0.9 (0.74-0.96), respectively. All experienced surgeons agreed that the task was relevant to practice. Conclusion. This study provides validity evidence for the task as a measure of laparoscopic suturing skill using an automated suturing device. It could help trainees acquire the skills they need to better prepare for clinical learning.

Download Full-text

Speech-Language Pathologists' Ratings of Speech Accuracy in Children With Speech Sound Disorders

American Journal of Speech-Language Pathology ◽

10.1044/2021_ajslp-20-00381 ◽

2021 ◽

pp. 1-12

Author(s):

Linye Jing ◽

Maria I. Grigos

Keyword(s):

Interrater Reliability ◽

Rating Scales ◽

Rating Scale ◽

Speech Sound ◽

Future Research ◽

Training Procedure ◽

Speech Sound Disorders ◽

Speech Language Pathologists ◽

Point Rating Scale ◽

Whole Word

Purpose: Forming accurate and consistent speech judgments can be challenging when working with children with speech sound disorders who produce a large number and varied types of error patterns. Rating scales offer a systematic approach to assessing the whole word rather than individual sounds. Thus, these scales can be an efficient way for speech-language pathologists (SLPs) to monitor treatment progress. This study evaluated the interrater reliability of an existing 3-point rating scale using a large group of SLPs as raters. Method: Utilizing an online platform, 30 SLPs completed a brief training and then rated single words produced by children with typical speech patterns and children with speech sound disorders. Words were closely balanced across the three rating categories of the scale. The interrater reliability of the SLPs ratings to a consensus judgment was examined. Results: The majority of SLPs (87%) reached substantial interrater reliability to a consensus judgment using the 3-point rating scale. Correct productions had the highest interrater reliability. Productions with extensive errors had higher agreement than those with minor errors. Certain error types, such as vowel distortions, were especially challenging for SLPs to judge. Conclusions: This study demonstrated substantial interrater reliability to a consensus judgment among a large majority of 30 SLPs using a 3-point rating. The clinical implications of the findings are discussed along with proposed modifications to the training procedure to guide future research.

Download Full-text

Visual assessment of movement quality in the single leg squat test: a review and meta-analysis of inter-rater and intrarater reliability

BMJ Open Sport & Exercise Medicine ◽

10.1136/bmjsem-2019-000541 ◽

2019 ◽

Vol 5 (1) ◽

pp. e000541 ◽

Cited By ~ 3

Author(s):

John Ressman ◽

Wilhelmus Johannes Andreas Grooten ◽

Eva Rasmussen Barr

Keyword(s):

Rating Scales ◽

Rating Scale ◽

Meta Analysis ◽

Intraclass Correlation ◽

Cochrane Library ◽

Intrarater Reliability ◽

Rater Reliability ◽

Movement Quality ◽

Step Down ◽

Single Leg Squat

Single leg squat (SLS) is a common tool used in clinical examination to set and evaluate rehabilitation goals, but also to assess lower extremity function in active people.ObjectivesTo conduct a review and meta-analysis on the inter-rater and intrarater reliability of the SLS, including the lateral step-down (LSD) and forward step-down (FSD) tests.DesignReview with meta-analysis.Data sourcesCINAHL, Cochrane Library, Embase, Medline (OVID) and Web of Science was searched up until December 2018.Eligibility criteriaStudies were eligible for inclusion if they were methodological studies which assessed the inter-rater and/or intrarater reliability of the SLS, FSD and LSD through observation of movement quality.ResultsThirty-one studies were included. The reliability varied largely between studies (inter-rater: kappa/intraclass correlation coefficients (ICC) = 0.00–0.95; intrarater: kappa/ICC = 0.13–1.00), but most of the studies reached ‘moderate’ measures of agreement. The pooled results of ICC/kappa showed a ‘moderate’ agreement for inter-rater reliability, 0.58 (95% CI 0.50 to 0.65), and a ‘substantial’ agreement for intrarater reliability, 0.68 (95% CI 0.60 to 0.74). Subgroup analyses showed a higher pooled agreement for inter-rater reliability of ≤3-point rating scales while no difference was found for different numbers of segmental assessments.ConclusionOur findings indicate that the SLS test including the FSD and LSD tests can be suitable for clinical use regardless of number of observed segments and particularly with a ≤3-point rating scale. Since most of the included studies were affected with some form of methodological bias, our findings must be interpreted with caution.PROSPERO registration numberCRD42018077822.

Download Full-text

Preschool Children's Taste Acceptance of Highly Concentrated Fluoride Compounds: Effects on Nonverbal Behavior

Journal of Clinical Pediatric Dentistry ◽

10.17796/jcpd.38.1.1501887254xt5u07 ◽

2013 ◽

Vol 38 (1) ◽

pp. 31-37 ◽

Cited By ~ 1

Author(s):

AK Kolb ◽

K Schmied ◽

P Faßheber ◽

R Heinrich-Weltzien

Keyword(s):

Preschool Children ◽

Nonverbal Behavior ◽

Interrater Reliability ◽

Rating Scale ◽

Intraclass Correlation ◽

Positive Behavior ◽

Behavior Rating ◽

Negative Behavior ◽

Behavior Rating Scale ◽

Standardized Protocol

Objective: The aim of this video-based study was to examine the taste acceptance of children between the ages of 2 and 5 years regarding highly concentrated fluoride preparations in kindergarten-based preventive programs. Study design: The fluoride preparation Duraphat was applied to 16 children, Elmex fluid to 15 children, and Fluoridin N5 to 14 children. The procedure was conducted according to a standardized protocol and videotaped. Three raters evaluated the children's nonverbal behavior as a measure of taste acceptance on the Frankl Behavior Rating Scale. The interrater reliability (intraclass correlation coefficient; ICC) was .86. In an interview, children indicated the taste of the fluoride preparations on a three-point “smiley” rating scale. The interviewer used a hand puppet during the survey to establish confidence between the children and examiners. Results: Children's nonverbal behavior was significantly more positive after Fluoridin N5 and Duraphat were applied compared to the application of Elmex fluid. The same trend was found during the smiley assessment. The response of children who displayed cooperative positive behavior before the application of fluoride preparations was significantly more positive than those who displayed uncooperative negative behavior. Conclusion: To achieve a high acceptance of the application of fluoride preparations among preschool children, flavorful preparations should be used.

Download Full-text

Factor Analytic Study of Social Behavior in Children

Psychological Reports ◽

10.2466/pr0.1969.25.2.399 ◽

1969 ◽

Vol 25 (2) ◽

pp. 399-406

Author(s):

S. Thomas Friedman ◽

Richard F. Purnell ◽

Edward E. Gotts

Keyword(s):

Interrater Reliability ◽

Rating Scales ◽

Rating Scale ◽

Ego Strength ◽

Personality Variables ◽

Adult Participant ◽

Factor Analytic Study ◽

Living Together ◽

Peer Orientation ◽

Children And Young Adolescents

The purpose was to use adult participant observers to create a scale for assessing some salient personality variables of children and young adolescents living together in close quarters. The 91 children were summer campers of both sexes (8 to 15 yr.). Counselors of these children were the adult participant observers. At least two counselors rated each camper on a 49-item rating scale. Interrater reliability was determined and composite ratings of the campers were factor analyzed. Seven factors accounted for the behaviors on the rating scales. These factors were consistent with and comparable to the constructs that were introduced into the items on the rating scale, e.g., Peer Orientation, Ego Strength, Interaction Potential, Adult Orientation, Rebelliousness, and Rigidity.

Download Full-text

Development and reliability of a structured interview guide for the Montgomery-Åsberg Depression Rating Scale (SIGMA)

The British Journal of Psychiatry ◽

10.1192/bjp.bp.106.032532 ◽

2008 ◽

Vol 192 (1) ◽

pp. 52-58 ◽

Cited By ~ 176

Author(s):

Janet B. W. Williams ◽

Kenneth A. Kobak

Keyword(s):

Clinical Trials ◽

Treatment Efficacy ◽

Interrater Reliability ◽

Rating Scale ◽

High Reliability ◽

Intraclass Correlation ◽

Structured Interview ◽

Interview Guide ◽

Test Retest Reliability ◽

Improve Reliability

BackgroundThe Montgomery-Åsberg Depression Rating Scale (MADRS) is often used in clinical trials to select patients and to assess treatment efficacy. The scale was originally published without suggested questions for clinicians to use in gathering the information necessary to rate the items. Structured and semi-structured interview guides have been found to improve reliability with other scales.AimsTo describe the development and test-retest reliability of a structured interview guide for the MADRS (SIGMA).MethodA total of 162 test-retest interviews were conducted by 81 rater pairs. Each patient was interviewed twice, once by each rater conducting an independent interview.ResultsThe intraclass correlation for total score between raters using the SIGMA was r = 0.93, P < 0.0001. All ten items had good to excellent interrater reliability.ConclusionsUse of the SIGMA can result in high reliability of MADRS scores in evaluating patients with depression.

Download Full-text

Signs and Symptoms of Psychotic Illness (SSPI): A rating scale

The British Journal of Psychiatry ◽

10.1192/bjp.180.1.45 ◽

2002 ◽

Vol 180 (1) ◽

pp. 45-50 ◽

Cited By ~ 95

Author(s):

Peter F. Liddle ◽

Elton T. C. Ngan ◽

Gary Duffield ◽

King Kho ◽

Anthony J. Warren

Keyword(s):

Factor Structure ◽

Rating Scales ◽

Rating Scale ◽

Intraclass Correlation ◽

Signs And Symptoms ◽

Antipsychotic Treatment ◽

Sensitivity To Change ◽

Psychotic Illness ◽

Score Change ◽

Reliability Sensitivity

BackgroundIn the rating scales commonly used for assessing response to antipsychotic treatment, individual items embrace symptoms that apparently arise from distinguishable pathophysiological processes and might be expected to respond differently to treatment.AimsTo test the reliability sensitivity to change and factor structure of a new scale for the assessment of the Signs and Symptoms of Psychotic Illness (the SSPI).MethodInterrater reliability was evaluated by determining the intraclass correlation for the ratings of 63 patients. Sensitivity to change was assessed in a longitudinal study of 33 patients. Factor structure was determined from scores for 155 patients.ResultsThe intraclass correlation was satisfactory for all individual items and excellent for the total score. Scores were sensitive to change. A change in Clinical Global Impression of one unit corresponded to an SSPI total score change of 31%. Factor analysis revealed five clusters of symptoms.ConclusionsThe SSPI provides a sensitive and reliable measure of the five major clusters of symptoms that occur commonly in psychotic illness.

Download Full-text

Assessment of Occupational Functioning for Screening of Patients to Occupational Therapy in General Psychiatric Care

The Occupational Therapy Journal of Research ◽

10.1177/153944929801800405 ◽

1998 ◽

Vol 18 (4) ◽

pp. 193-206 ◽

Cited By ~ 6

Author(s):

Lena Haglund ◽

Lars-Hakan Thorell ◽

Jan Walinder

Keyword(s):

Occupational Therapy ◽

Correlation Coefficient ◽

Intraclass Correlation Coefficient ◽

Interrater Reliability ◽

Psychiatric Care ◽

Rating Scale ◽

Intraclass Correlation ◽

Case Analysis ◽

Occupational Therapist ◽

Swedish Version

A Swedish version of the Occupational Case Analysis Interview and Rating Scale (OCAIRS-S) has been tested earlier for interrater reliability. The present study, using the second version of OCAIRS-S and including a sample of 145 patients, showed interrater correlations between .88 and .96 (Intraclass Correlation Coefficient). The results indicate that OCAIRS-S predicts which patients should be included in and excluded from occupational therapy and identifies patients who should be observed more before making such decisions. The study indicates a need for further investigations regarding which components in OCAIRS-S influence the occupational therapist in judging the patient's need for occupational therapy.

Download Full-text

Validation of a Spanish Version of the Lille Apathy Rating Scale for Parkinson’s Disease

The Scientific World JOURNAL ◽

10.1155/2014/849834 ◽

2014 ◽

Vol 2014 ◽

pp. 1-7 ◽

Cited By ~ 13

Author(s):

Rocio García-Ramos ◽

Clara Villanueva Iza ◽

María José Catalán ◽

Abilio Reig-Ferrer ◽

Jorge Matías-Guíu

Keyword(s):

Parkinson’S Disease ◽

Parkinson's Disease ◽

Interrater Reliability ◽

Rating Scales ◽

Rating Scale ◽

Reliability And Validity ◽

Spanish Version ◽

Healthy Controls ◽

Lars Score ◽

Optimal Cutoff Point

Introduction. To date, no rating scales for detecting apathy in Parkinson’s disease (PD) patients have been validated in Spanish. For this reason, the aim of this study was to validate a Spanish version of Lille apathy rating scale (LARS) in a cohort of PD patients from Spain.Participants and Methods. 130 PD patients and 70 healthy controls were recruited to participate in the study. Apathy was measured using the Spanish version of LARS and the neuropsychiatric inventory (NPI). Reliability (internal consistency, test-retest, and interrater reliability) and validity (construct, content, and criterion validity) were measured.Results. Interrater reliability was 0.93. Cronbach’sαfor LARS was 0.81. The test-retest correlation coefficient was 0.97. The correlation between LARS and NPI scores was 0.61. The optimal cutoff point under the ROC curve was-14, whereas the value derived from healthy controls was-11. The prevalence of apathy in our population tested by LARS was 42%.Conclusions. The Spanish version of LARS is a reliable and useful tool for diagnosing apathy in PD patients. Total LARS score is influenced by the presence of depression and cognitive impairment. However, both disorders are independent identities with respect to apathy. The satisfactory reliability and validity of the scale make it an appropriate instrument for screening and diagnosing apathy in clinical practice or for research purposes.

Download Full-text