scholarly journals On the Usefulness of Interrater Reliability Coefficients

Author(s):  
Debby ten Hove ◽  
Terrence D. Jorgensen ◽  
L. Andries van der Ark
2002 ◽  
Vol 18 (1) ◽  
pp. 52-62 ◽  
Author(s):  
Olga F. Voskuijl ◽  
Tjarda van Sliedregt

Summary: This paper presents a meta-analysis of published job analysis interrater reliability data in order to predict the expected levels of interrater reliability within specific combinations of moderators, such as rater source, experience of the rater, and type of job descriptive information. The overall mean interrater reliability of 91 reliability coefficients reported in the literature was .59. The results of experienced professionals (job analysts) showed the highest reliability coefficients (.76). The method of data collection (job contact versus job description) only affected the results of experienced job analysts. For this group higher interrater reliability coefficients were obtained for analyses based on job contact (.87) than for those based on job descriptions (.71). For other rater categories (e.g., students, organization members) neither the method of data collection nor training had a significant effect on the interrater reliability. Analyses based on scales with defined levels resulted in significantly higher interrater reliability coefficients than analyses based on scales with undefined levels. Behavior and job worth dimensions were rated more reliable (.62 and .60, respectively) than attributes and tasks (.49 and .29, respectively). Furthermore, the results indicated that if nonprofessional raters are used (e.g., incumbents or students), at least two to four raters are required to obtain a reliability coefficient of .80. These findings have implications for research and practice.


1987 ◽  
Vol 61 (3) ◽  
pp. 1009-1010 ◽  
Author(s):  
Mark A. Runco

The reliability and true variance of a socially valid measure of creativity was assessed by asking three judges to rate the creativity of 29 adolescents. Interitem reliability was .93; interrater reliability was .48; and true score variance, estimated from the interitem and interrater reliability coefficients, was .65.


1992 ◽  
Vol 74 (2) ◽  
pp. 347-353 ◽  
Author(s):  
Elizabeth M. Mason

The purpose of this study was to investigate the interrater reliability of the visual-motor portion of the Copying subtest of the Stanford-Binet Intelligence Scale: Fourth Edition. Eight raters independently scored 11 protocols completed by children aged 5 through 10 years, using the scoring criteria and guidelines in the manual. The raters marked each of 10 items pass or fail and computed a total raw score for each protocol. Interrater reliability coefficients were obtained for each child's protocol, and the Kappa coefficient was computed for each item. Significant raters' reliability coefficients ranged from .82 to .91, which were low in comparison to test-retest reliability and Kuder-Richardson-20 coefficients for this and other subtests of the Stanford-Binet in the technical manual. Percent agreement among 8 raters also indicated weak reliability. Although the obtained results suggested some interrater reliability coefficients within acceptable levels, questions were raised about the scoring criteria for individual items. Caution is warranted in the use of cognitive measures which include subjective judgement of the examiner in applying scoring criteria.


1993 ◽  
Vol 77 (3_suppl) ◽  
pp. 1215-1218 ◽  
Author(s):  
Susan Dickerson Mayes ◽  
Edward O. Bixler

Agreement between raters using global impressions to assess methylphenidate response was analyzed for children with Attention-Deficit Hyperactivity Disorder (ADHD) undergoing double-blind, placebo-controlled, crossover methylphenidate trials. Caregivers were more likely to disagree than agree when asked to rate the children as “better, same, or worse” during each day of the trial. Over-all agreement was 42.9%, only 9.6% above what would be expected based on chance alone. Further, none of the interrater reliability coefficients (Cohen's kappa) for the individual children were statistically significant.


2017 ◽  
Vol 25 (0) ◽  
Author(s):  
Maria Alzete de Lima ◽  
Lorita Marlena Freitag Pagliuca ◽  
Jennara Cândido do Nascimento ◽  
Joselany Áfio Caetano

Reume Objective: to compare Interrater reliability concerning two eye assessment methods. Method: quasi-experimental study conducted with 324 college students including eye self-examination and eye assessment performed by the researchers in a public university. Kappa coefficient was used to verify agreement. Results: reliability coefficients between Interraters ranged from 0.85 to 0.95, with statistical significance at 0.05. The exams to check for near acuity and peripheral vision presented a reasonable kappa >0.2. The remaining coefficients were higher, ranging from very to totally reliable. Conclusion: comparatively, the results of both methods were similar. The virtual manual on eye self-examination can be used to screen for eye conditions.


2020 ◽  
pp. 019874292096915
Author(s):  
Jacqueline Huscroft-D’Angelo ◽  
Jessica Wery ◽  
Jodie Diane Martin ◽  
Corey Pierce ◽  
Lindy Crawford

The Scales for Assessing Emotional Disturbance—Third Edition Rating Scale (SAED-3 RS; Epstein et al.) is a standardized, norm-referenced measure designed to aid in the identification process by providing useful data to professionals determining eligibility of students with an emotional disturbance (ED). Three studies are reported to address the reliability of the SAED-3 RS. Study 1 investigated the internal reliability of the SAED-3 RS using data from a nationally representative sample of 1,430 students and 441 with ED. Study 2 examined interrater reliability between 123 pairs of educators who had worked with the student for at least 2 months. Study 3 assessed the test–retest reliability over a 2-week period to determine stability of the SAED-3 RS. Across all studies, scores collected from the SAED-3 RS were determined to be a reliable, stable for measuring the emotional and behavioral functioning of students. Specifically, the averaged coefficient alpha for internal consistency ranged from .79 to .92 for each subscale and .96 for the composite score; interrater reliability coefficients ranged from .77 to .89 for each subscale and .89 for the composite score, and test–retest reliability coefficients ranged from .79 to .92 for each subscale and .96 for the composite score. Limitations, future research and implications for use of the SAED-3 RS are discussed.


1970 ◽  
Vol 26 (2) ◽  
pp. 451-456 ◽  
Author(s):  
James J. Asher

Since the evidence has shown that the usual selection interview is unreliable, an alternate strategy was developed called the Q by Q Interview. The Q by Q format requires the interviewer to make a series of selection decisions instead of the usual procedure of waiting until the end of the interview to make a global rating. In 3 selection problems the Q by Q Interview resulted in high interrater reliability and in 3 other selection situations, reliability was low. An analysis of the raters' decision making indicated that when applicant responses were similar, there was range restriction which produced spuriously low reliability coefficients.


Sign in / Sign up

Export Citation Format

Share Document