Examiner Variability in Clinical Assessments: Do Examiner Pairings Influence Candidate Ratings?
Abstract Background The reliability of clinical assessments is known to vary considerably and inter-examiner variability is a key contributor. This may result in significant differences in scores between comparable candidates, a serious challenge in medical education. An approach frequently adopted to avoid this and improve reliability is to pair examiners and ask them to come to an agreed score. Little is known however, about what occurs when these paired examiners interact to generate a score.Methods A fully-crossed design was employed with each participant examiner observing and scoring. A quasi-experimental research design used candidate’s observed scores in a mock clinical assessment as the dependent variable. The independent variables were examiner numbers, demographics and personality. Demographic and personality data was collected by questionnaire. A purposeful sample of medical doctors who examine in the Final Medical examination at our institution was recruited.Results Variability between scores given by examiner pairs (N=6) was less than the variability with individual examiners (N=12). 75% of examiners (N=9) scored below average for neuroticism and 75% also scored high or very high for extroversion. Two thirds scored high or very high for conscientiousness. The higher an examiner’s personality score for extroversion, the lower the amount of change in his/her score when paired up with a co-examiner; reflecting possibly a more dominant role in the process of reaching a consensus score.Conclusions While the variability between scores given by examiner pairs (N=6) was less than the variability with individual examiners (N=12), the reliability statistics for both assessments were comparable. Using paired examiners resulted in a more accurate and robust score than simply averaging two independent examiners scores. The higher an examiner’s personality score for extroversion, the lower the amount of change in his/her score when paired up with a co-examiner; reflecting possibly a more dominant role in the process of reaching a consensus score. These findings could have implications for the organisation and administration of clinical assessments. Further studies with larger numbers of participants might establish if personality testing before choosing examiner pairs could be utilised to help pair examiners and improve examiner variability.