scholarly journals Preliminary Validity Evidence for a Milestones-Based Rating Scale for Chart-Stimulated Recall

2018 ◽  
Vol 10 (3) ◽  
pp. 269-275 ◽  
Author(s):  
Shalini T. Reddy ◽  
Ara Tekian ◽  
Steven J. Durning ◽  
Shanu Gupta ◽  
Justin Endo ◽  
...  

ABSTRACT Background  Minimally anchored Standard Rating Scales (SRSs), which are widely used in medical education, are hampered by suboptimal interrater reliability. Expert-derived frameworks, such as the Accreditation Council for Graduate Medical Education (ACGME) Milestones, may be helpful in defining level-specific anchors to use on rating scales. Objective  We examined validity evidence for a Milestones-Based Rating Scale (MBRS) for scoring chart-stimulated recall (CSR). Methods  Two 11-item scoring forms with either an MBRS or SRS were developed. Items and anchors for the MBRS were adapted from the ACGME Internal Medicine Milestones. Six CSR standardized videos were developed. Clinical faculty scored videos using either the MBRS or SRS and following a randomized crossover design. Reliability of the MBRS versus the SRS was compared using intraclass correlation. Results  Twenty-two faculty were recruited for instrument testing. Some participants did not complete scoring, leaving a response rate of 15 faculty (7 in the MBRS group and 8 in the SRS group). A total of 529 ratings (number of items × number of scores) using SRSs and 540 using MBRSs were available. Percent agreement was higher for MBRSs for only 2 of 11 items—use of consultants (92 versus 75, P = .019) and unique characteristics of patients (96 versus 79, P = .011)—and the overall score (89 versus 82, P < .001). Interrater agreement was 0.61 for MBRSs and 0.51 for SRSs. Conclusions  Adding milestones to our rating form resulted in significant, but not substantial, improvement in intraclass correlation coefficient. Improvement was inconsistent across items.

2020 ◽  
Vol 80 (4) ◽  
pp. 808-820
Author(s):  
Cindy M. Walker ◽  
Sakine Göçer Şahin

The purpose of this study was to investigate a new way of evaluating interrater reliability that can allow one to determine if two raters differ with respect to their rating on a polytomous rating scale or constructed response item. Specifically, differential item functioning (DIF) analyses were used to assess interrater reliability and compared with traditional interrater reliability measures. Three different procedures that can be used as measures of interrater reliability were compared: (1) intraclass correlation coefficient (ICC), (2) Cohen’s kappa statistic, and (3) DIF statistic obtained from Poly-SIBTEST. The results of this investigation indicated that DIF procedures appear to be a promising alternative to assess the interrater reliability of constructed response items, or other polytomous types of items, such as rating scales. Furthermore, using DIF to assess interrater reliability does not require a fully crossed design and allows one to determine if a rater is either more severe, or more lenient, in their scoring of each individual polytomous item on a test or rating scale.


2018 ◽  
Vol 25 (3) ◽  
pp. 286-290 ◽  
Author(s):  
Elif Bilgic ◽  
Madoka Takao ◽  
Pepa Kaneva ◽  
Satoshi Endo ◽  
Toshitatsu Takao ◽  
...  

Background. Needs assessment identified a gap regarding laparoscopic suturing skills targeted in simulation. This study collected validity evidence for an advanced laparoscopic suturing task using an Endo StitchTM device. Methods. Experienced (ES) and novice surgeons (NS) performed continuous suturing after watching an instructional video. Scores were based on time and accuracy, and Global Operative Assessment of Laparoscopic Surgery. Data are shown as medians [25th-75th percentiles] (ES vs NS). Interrater reliability was calculated using intraclass correlation coefficients (confidence interval). Results. Seventeen participants were enrolled. Experienced surgeons had significantly greater task (980 [964-999] vs 666 [391-711], P = .0035) and Global Operative Assessment of Laparoscopic Surgery scores (25 [24-25] vs 14 [12-17], P = .0029). Interrater reliability for time and accuracy were 1.0 and 0.9 (0.74-0.96), respectively. All experienced surgeons agreed that the task was relevant to practice. Conclusion. This study provides validity evidence for the task as a measure of laparoscopic suturing skill using an automated suturing device. It could help trainees acquire the skills they need to better prepare for clinical learning.


Author(s):  
Linye Jing ◽  
Maria I. Grigos

Purpose: Forming accurate and consistent speech judgments can be challenging when working with children with speech sound disorders who produce a large number and varied types of error patterns. Rating scales offer a systematic approach to assessing the whole word rather than individual sounds. Thus, these scales can be an efficient way for speech-language pathologists (SLPs) to monitor treatment progress. This study evaluated the interrater reliability of an existing 3-point rating scale using a large group of SLPs as raters. Method: Utilizing an online platform, 30 SLPs completed a brief training and then rated single words produced by children with typical speech patterns and children with speech sound disorders. Words were closely balanced across the three rating categories of the scale. The interrater reliability of the SLPs ratings to a consensus judgment was examined. Results: The majority of SLPs (87%) reached substantial interrater reliability to a consensus judgment using the 3-point rating scale. Correct productions had the highest interrater reliability. Productions with extensive errors had higher agreement than those with minor errors. Certain error types, such as vowel distortions, were especially challenging for SLPs to judge. Conclusions: This study demonstrated substantial interrater reliability to a consensus judgment among a large majority of 30 SLPs using a 3-point rating. The clinical implications of the findings are discussed along with proposed modifications to the training procedure to guide future research.


2019 ◽  
Vol 5 (1) ◽  
pp. e000541 ◽  
Author(s):  
John Ressman ◽  
Wilhelmus Johannes Andreas Grooten ◽  
Eva Rasmussen Barr

Single leg squat (SLS) is a common tool used in clinical examination to set and evaluate rehabilitation goals, but also to assess lower extremity function in active people.ObjectivesTo conduct a review and meta-analysis on the inter-rater and intrarater reliability of the SLS, including the lateral step-down (LSD) and forward step-down (FSD) tests.DesignReview with meta-analysis.Data sourcesCINAHL, Cochrane Library, Embase, Medline (OVID) and Web of Science was searched up until December 2018.Eligibility criteriaStudies were eligible for inclusion if they were methodological studies which assessed the inter-rater and/or intrarater reliability of the SLS, FSD and LSD through observation of movement quality.ResultsThirty-one studies were included. The reliability varied largely between studies (inter-rater: kappa/intraclass correlation coefficients (ICC) = 0.00–0.95; intrarater: kappa/ICC = 0.13–1.00), but most of the studies reached ‘moderate’ measures of agreement. The pooled results of ICC/kappa showed a ‘moderate’ agreement for inter-rater reliability, 0.58 (95% CI 0.50 to 0.65), and a ‘substantial’ agreement for intrarater reliability, 0.68 (95% CI 0.60 to 0.74). Subgroup analyses showed a higher pooled agreement for inter-rater reliability of ≤3-point rating scales while no difference was found for different numbers of segmental assessments.ConclusionOur findings indicate that the SLS test including the FSD and LSD tests can be suitable for clinical use regardless of number of observed segments and particularly with a ≤3-point rating scale. Since most of the included studies were affected with some form of methodological bias, our findings must be interpreted with caution.PROSPERO registration numberCRD42018077822.


2013 ◽  
Vol 38 (1) ◽  
pp. 31-37 ◽  
Author(s):  
AK Kolb ◽  
K Schmied ◽  
P Faßheber ◽  
R Heinrich-Weltzien

Objective: The aim of this video-based study was to examine the taste acceptance of children between the ages of 2 and 5 years regarding highly concentrated fluoride preparations in kindergarten-based preventive programs. Study design: The fluoride preparation Duraphat was applied to 16 children, Elmex fluid to 15 children, and Fluoridin N5 to 14 children. The procedure was conducted according to a standardized protocol and videotaped. Three raters evaluated the children's nonverbal behavior as a measure of taste acceptance on the Frankl Behavior Rating Scale. The interrater reliability (intraclass correlation coefficient; ICC) was .86. In an interview, children indicated the taste of the fluoride preparations on a three-point “smiley” rating scale. The interviewer used a hand puppet during the survey to establish confidence between the children and examiners. Results: Children's nonverbal behavior was significantly more positive after Fluoridin N5 and Duraphat were applied compared to the application of Elmex fluid. The same trend was found during the smiley assessment. The response of children who displayed cooperative positive behavior before the application of fluoride preparations was significantly more positive than those who displayed uncooperative negative behavior. Conclusion: To achieve a high acceptance of the application of fluoride preparations among preschool children, flavorful preparations should be used.


1969 ◽  
Vol 25 (2) ◽  
pp. 399-406
Author(s):  
S. Thomas Friedman ◽  
Richard F. Purnell ◽  
Edward E. Gotts

The purpose was to use adult participant observers to create a scale for assessing some salient personality variables of children and young adolescents living together in close quarters. The 91 children were summer campers of both sexes (8 to 15 yr.). Counselors of these children were the adult participant observers. At least two counselors rated each camper on a 49-item rating scale. Interrater reliability was determined and composite ratings of the campers were factor analyzed. Seven factors accounted for the behaviors on the rating scales. These factors were consistent with and comparable to the constructs that were introduced into the items on the rating scale, e.g., Peer Orientation, Ego Strength, Interaction Potential, Adult Orientation, Rebelliousness, and Rigidity.


2008 ◽  
Vol 192 (1) ◽  
pp. 52-58 ◽  
Author(s):  
Janet B. W. Williams ◽  
Kenneth A. Kobak

BackgroundThe Montgomery-Åsberg Depression Rating Scale (MADRS) is often used in clinical trials to select patients and to assess treatment efficacy. The scale was originally published without suggested questions for clinicians to use in gathering the information necessary to rate the items. Structured and semi-structured interview guides have been found to improve reliability with other scales.AimsTo describe the development and test-retest reliability of a structured interview guide for the MADRS (SIGMA).MethodA total of 162 test-retest interviews were conducted by 81 rater pairs. Each patient was interviewed twice, once by each rater conducting an independent interview.ResultsThe intraclass correlation for total score between raters using the SIGMA was r = 0.93, P < 0.0001. All ten items had good to excellent interrater reliability.ConclusionsUse of the SIGMA can result in high reliability of MADRS scores in evaluating patients with depression.


2002 ◽  
Vol 180 (1) ◽  
pp. 45-50 ◽  
Author(s):  
Peter F. Liddle ◽  
Elton T. C. Ngan ◽  
Gary Duffield ◽  
King Kho ◽  
Anthony J. Warren

BackgroundIn the rating scales commonly used for assessing response to antipsychotic treatment, individual items embrace symptoms that apparently arise from distinguishable pathophysiological processes and might be expected to respond differently to treatment.AimsTo test the reliability sensitivity to change and factor structure of a new scale for the assessment of the Signs and Symptoms of Psychotic Illness (the SSPI).MethodInterrater reliability was evaluated by determining the intraclass correlation for the ratings of 63 patients. Sensitivity to change was assessed in a longitudinal study of 33 patients. Factor structure was determined from scores for 155 patients.ResultsThe intraclass correlation was satisfactory for all individual items and excellent for the total score. Scores were sensitive to change. A change in Clinical Global Impression of one unit corresponded to an SSPI total score change of 31%. Factor analysis revealed five clusters of symptoms.ConclusionsThe SSPI provides a sensitive and reliable measure of the five major clusters of symptoms that occur commonly in psychotic illness.


1998 ◽  
Vol 18 (4) ◽  
pp. 193-206 ◽  
Author(s):  
Lena Haglund ◽  
Lars-Hakan Thorell ◽  
Jan Walinder

A Swedish version of the Occupational Case Analysis Interview and Rating Scale (OCAIRS-S) has been tested earlier for interrater reliability. The present study, using the second version of OCAIRS-S and including a sample of 145 patients, showed interrater correlations between .88 and .96 (Intraclass Correlation Coefficient). The results indicate that OCAIRS-S predicts which patients should be included in and excluded from occupational therapy and identifies patients who should be observed more before making such decisions. The study indicates a need for further investigations regarding which components in OCAIRS-S influence the occupational therapist in judging the patient's need for occupational therapy.


2014 ◽  
Vol 2014 ◽  
pp. 1-7 ◽  
Author(s):  
Rocio García-Ramos ◽  
Clara Villanueva Iza ◽  
María José Catalán ◽  
Abilio Reig-Ferrer ◽  
Jorge Matías-Guíu

Introduction. To date, no rating scales for detecting apathy in Parkinson’s disease (PD) patients have been validated in Spanish. For this reason, the aim of this study was to validate a Spanish version of Lille apathy rating scale (LARS) in a cohort of PD patients from Spain.Participants and Methods. 130 PD patients and 70 healthy controls were recruited to participate in the study. Apathy was measured using the Spanish version of LARS and the neuropsychiatric inventory (NPI). Reliability (internal consistency, test-retest, and interrater reliability) and validity (construct, content, and criterion validity) were measured.Results. Interrater reliability was 0.93. Cronbach’sαfor LARS was 0.81. The test-retest correlation coefficient was 0.97. The correlation between LARS and NPI scores was 0.61. The optimal cutoff point under the ROC curve was-14, whereas the value derived from healthy controls was-11. The prevalence of apathy in our population tested by LARS was 42%.Conclusions. The Spanish version of LARS is a reliable and useful tool for diagnosing apathy in PD patients. Total LARS score is influenced by the presence of depression and cognitive impairment. However, both disorders are independent identities with respect to apathy. The satisfactory reliability and validity of the scale make it an appropriate instrument for screening and diagnosing apathy in clinical practice or for research purposes.


Sign in / Sign up

Export Citation Format

Share Document