Abstract

We read with great interest the publication by Ruiz de Gauna et al.1. We agree that estimating the agreement between different observers is crucial for a given diagnostic method to be introduced into clinical practice. However, we have a question and suggestions about the statistical analysis, because this can affect the interpretation of data. Cohen's kappa coefficient is a measure of interobserver or interdevice agreement for qualitative (categorical) items used in many scientific fields, which takes into account the possibility that the agreement may have occurred by chance. However, its interpretation has been criticized severely. It is associated with a well-known problem, referred to by Feinstein and Cicchetti2 as ‘the first paradox of kappa’. Usually, we can expect the value of kappa to be high when both observers assess the same one of several categories with a high probability, and low when both observers assess evenly for all categories. However, the more imbalanced are the marginal distributions of each category, the greater is the chance of agreement being high, and in this case, paradoxically, the magnitude of kappa reduces considerably despite the high observed agreements. Such imbalanced marginal distribution phenomena often occur when the sample data are obtained from a population with a very small prevalence rate3. In the study of Ruiz de Gauna et al., the observers evaluated the presence or absence of five benign features (B1–B5) and five malignant features (M1–M5) during real-time ultrasound examination in women with a persistent adnexal mass; for each feature, the authors evaluated the agreement between observers by calculating the standard kappa index in their Table 31. Agreement was classified from ‘poor’ to ‘very good’ according to the kappa values; agreement was described as being: very good for B1, B4, M4; good for B2, M3, M5, moderate for B5, M1, M2 and fair for B3. The prevalence of observation of the 10 features in their Table 3, ranging from 4.8% to 28.6%, shows the imbalanced marginal distribution. Therefore, it is expected that features with the same percentage (observed) agreement will have a different kappa index, and that the features with a lower prevalence will have a low kappa index. For example, B1 and B2, with the same percentage agreement (95.2%), were classified differently: B2 (4.8% prevalence) was categorized as ‘good’ and B1 (17.8% prevalence) as ‘very good’, based on the kappa values of 0.64 and 0.89, respectively. Similarly, B5 and M5, with the same percentage agreement (80.9%), were also categorized differently, as ‘moderate’ and ‘good’, respectively. The features with low prevalence rates had a low kappa value, despite the high percentage agreement. For this reason, kappa is considered an overly conservative measure of agreement. It seems to be more appropriate to use the interclass kappa4 or the first-order agreement coefficient5 as a measure of agreement for analysis of data with severely imbalanced marginal distributions. H. S. Ko†, N. Kim† and Y.-G. Park*‡ †Department of Obstetrics and Gynecology, Seoul St Mary's Hospital, College of Medicine, The Catholic University of Korea, Seoul, Republic of Korea; ‡Department of Biostatistics, College of Medicine, The Catholic University of Korea, 222, Banpo-daero, Seocho-gu, Seoul 137-701, Republic of Korea *Correspondence. (e-mail: [email protected])

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call