Assessment of examiner leniency and stringency ('hawk-dove effect') in the MRCP(UK) clinical examination (PACES) using multi-facet Rasch modelling

Ic Mcmanus,M Thompson,J Mollon

doi:10.1186/1472-6920-6-42

Ic Mcmanus, M Thompson + Show 1 more

Open Access

https://doi.org/10.1186/1472-6920-6-42

Copy DOI

Journal: BMC medical education	Publication Date: Aug 18, 2006
Citations: 150	License type: CC BY 2.0

Affiliation: University College London

Abstract

BackgroundA potential problem of clinical examinations is known as the hawk-dove problem, some examiners being more stringent and requiring a higher performance than other examiners who are more lenient. Although the problem has been known qualitatively for at least a century, we know of no previous statistical estimation of the size of the effect in a large-scale, high-stakes examination. Here we use FACETS to carry out a multi-facet Rasch modelling of the paired judgements made by examiners in the clinical examination (PACES) of MRCP(UK), where identical candidates were assessed in identical situations, allowing calculation of examiner stringency.MethodsData were analysed from the first nine diets of PACES, which were taken between June 2001 and March 2004 by 10,145 candidates. Each candidate was assessed by two examiners on each of seven separate tasks. with the candidates assessed by a total of 1,259 examiners, resulting in a total of 142,030 marks. Examiner demographics were described in terms of age, sex, ethnicity, and total number of candidates examined.ResultsFACETS suggested that about 87% of main effect variance was due to candidate differences, 1% due to station differences, and 12% due to differences between examiners in leniency-stringency. Multiple regression suggested that greater examiner stringency was associated with greater examiner experience and being from an ethnic minority. Male and female examiners showed no overall difference in stringency. Examination scores were adjusted for examiner stringency and it was shown that for the present pass mark, the outcome for 95.9% of candidates would be unchanged using adjusted marks, whereas 2.6% of candidates would have passed, even though they had failed on the basis of raw marks, and 1.5% of candidates would have failed, despite passing on the basis of raw marks.ConclusionExaminers do differ in their leniency or stringency, and the effect can be estimated using Rasch modelling. The reasons for differences are not clear, but there are some demographic correlates, and the effects appear to be reliable across time. Account can be taken of differences, either by adjusting marks or, perhaps more effectively and more justifiably, by pairing high and low stringency examiners, so that raw marks can be used in the determination of pass and fail.

Highlights

A potential problem of clinical examinations is known as the hawk-dove problem, some examiners being more stringent and requiring a higher performance than other examiners who are more lenient
Candidate data The first nine diets of the Practical Assessment of Clinical Examination Skills (PACES) examination were taken between June 2001 and March 2004, with two diets in 2001, three in 2002 and 2003, and one in 2004
This paper describes an analysis of 10,145 candidates taking the PACES examination over nine diets, when they were examined by a total of 1,259 examiners who awarded a total of 142,030 marks

Summary

Introduction

A potential problem of clinical examinations is known as the hawk-dove problem, some examiners being more stringent and requiring a higher performance than other examiners who are more lenient. The problem has been known qualitatively for at least a century, we know of no previous statistical estimation of the size of the effect in a large-scale, high-stakes examination. A potential vulnerability of any clinical examination is that examiners differ in their relative leniency or stringency. This is known as the 'hawk-dove' effect, hawks tending to fail most candidates because of having very high standards, whereas doves tend to pass most candidates. The problem of hawks and doves is easy enough to describe, finding an effective statistical technique for assessing it is far from straightforward

Methods

Results

Conclusion