Pitfalls in the use of kappa when interpreting agreement between multiple raters in reliability studies

Shaun O’Leary,Marte Lund,Tore Johan Ytre-Hauge,Sigrid Reiersen Holm,Kaja Naess,Lars Nagelstad Dalland,Steven M Mcphail

doi:10.1016/j.physio.2013.08.002

Abstract

ObjectiveTo compare different reliability coefficients (exact agreement, and variations of the kappa (generalised, Cohen's and Prevalence Adjusted and Biased Adjusted (PABAK))) for four physiotherapists conducting visual assessments of scapulae. DesignInter-therapist reliability study. SettingResearch laboratory. Participants30 individuals with no history of neck or shoulder pain were recruited with no obvious significant postural abnormalities. Main outcome measuresRatings of scapular posture were recorded in multiple biomechanical planes under four test conditions (at rest, and while under three isometric conditions) by four physiotherapists. ResultsThe magnitude of discrepancy between the two therapist pairs was 0.04 to 0.76 for Cohen's kappa, and 0.00 to 0.86 for PABAK. In comparison, the generalised kappa provided a score between the two paired kappa coefficients. The difference between mean generalised kappa coefficients and mean Cohen's kappa (0.02) and between mean generalised kappa and PABAK (0.02) were negligible, but the magnitude of difference between the generalised kappa and paired kappa within each plane and condition was substantial; 0.02 to 0.57 for Cohen's kappa and 0.02 to 0.63 for PABAK, respectively. ConclusionsCalculating coefficients for therapist pairs alone may result in inconsistent findings. In contrast, the generalised kappa provided a coefficient close to the mean of the paired kappa coefficients. These findings support an assertion that generalised kappa may lead to a better representation of reliability between three or more raters and that reliability studies only calculating agreement between two raters should be interpreted with caution. However, generalised kappa may mask more extreme cases of agreement (or disagreement) that paired comparisons may reveal.

Highlights

Clinical decision making is often based on examination findings that are dependent on subjective nominal ratings of status such as that used to evaluate posture alignment [1,2,3,4]
The prevalence of neutral ratings varied between the different postural planes (Table 2), with the highest for the horizontal plane (86% to 95%), followed by the vertical (83% to 86%), transverse (63% to 77%), scapular (32% to 83%), and sagittal (33% to 63%) planes
The highest mean generalised kappa was calculated from ratings in the sagittal plane (0.63(0.57 to 0.78)), followed by the scapular (0.47(0.32 to 0.57)), transverse (0.41(0.38 to 0.44)), horizontal plane (0.33(0.1 to 0.65)), and vertical (0.22(0.08 to 0.3)), planes

Summary

Introduction

Clinical decision making is often based on examination findings that are dependent on subjective nominal ratings of status such as that used to evaluate posture alignment [1,2,3,4]. Investigating the agreement between physiotherapists using these nominal clinical measures is challenging due to the subjective nature of the ratings, and due to the limitations of reliability coefficients that are utilised to express agreement within and between therapists [5,6]. One advantage of utilising Cohen’s kappa to examine agreement between raters is the ability to use a weighting system to penalise disagreements of larger magnitude more than disagreements of a smaller magnitude. In this way, a disparate disagreement will contribute to a lower kappa coefficient more than a disagreement of smaller magnitude (in ordinal data sets with three or more levels) [7,9]

Results

Discussion

Conclusion