A comparison of Cohen’s Kappa and Gwet’s AC1 when calculating inter-rater reliability coefficients: a study conducted with personality disorder samples

Nahathai Wongpakaran,Tinakon Wongpakaran,Kilem L Gwet,Danny Wedding

doi:10.1186/1471-2288-13-61

Nahathai Wongpakaran, Tinakon Wongpakaran + Show 2 more

Open Access

https://doi.org/10.1186/1471-2288-13-61

Copy DOI

Abstract

BackgroundRater agreement is important in clinical research, and Cohen’s Kappa is a widely used method for assessing inter-rater reliability; however, there are well documented statistical problems associated with the measure. In order to assess its utility, we evaluated it against Gwet’s AC1 and compared the results.MethodsThis study was carried out across 67 patients (56% males) aged 18 to 67, with a mean SD of 44.13 ± 12.68 years. Nine raters (7 psychiatrists, a psychiatry resident and a social worker) participated as interviewers, either for the first or the second interviews, which were held 4 to 6 weeks apart. The interviews were held in order to establish a personality disorder (PD) diagnosis using DSM-IV criteria. Cohen’s Kappa and Gwet’s AC1 were used and the level of agreement between raters was assessed in terms of a simple categorical diagnosis (i.e., the presence or absence of a disorder). Data were also compared with a previous analysis in order to evaluate the effects of trait prevalence.ResultsGwet’s AC1 was shown to have higher inter-rater reliability coefficients for all the PD criteria, ranging from .752 to 1.000, whereas Cohen’s Kappa ranged from 0 to 1.00. Cohen’s Kappa values were high and close to the percentage of agreement when the prevalence was high, whereas Gwet’s AC1 values appeared not to change much with a change in prevalence, but remained close to the percentage of agreement. For example a Schizoid sample revealed a mean Cohen’s Kappa of .726 and a Gwet’s AC1of .853 , which fell within the different level of agreement according to criteria developed by Landis and Koch, and Altman and Fleiss.ConclusionsBased on the different formulae used to calculate the level of chance-corrected agreement, Gwet’s AC1 was shown to provide a more stable inter-rater reliability coefficient than Cohen’s Kappa. It was also found to be less affected by prevalence and marginal probability than that of Cohen’s Kappa, and therefore should be considered for use with inter-rater reliability analysis.

Highlights

Rater agreement is important in clinical research, and Cohen’s Kappa is a widely used method for assessing inter-rater reliability; there are well documented statistical problems associated with the measure
The most common disagreement among the 4 pairs of raters was in relation to Schizoid and Passive-Aggressive personality disorder (PD) (3 out of the 4 pairs), while the second most common was Dependent, Obsessive-Compulsive
The effect of trait prevalence Trait prevalence here was calculated based on the number of positive cases, as judged by both raters, calculated as a percentage of the total number of cases, and inter-rater reliability (Tables 3, 4 and 5)

Summary

Introduction

Rater agreement is important in clinical research, and Cohen’s Kappa is a widely used method for assessing inter-rater reliability; there are well documented statistical problems associated with the measure. The Structured Clinical Interview, based on the Diagnostic and Statistical Manual of Mental Disorders-IV for Axis II Personality Disorders (SCID II) [1], is one of the standard tools used to diagnose personality disorders. Because this assessment results in dichotomous outcomes, Cohen’s Kappa [2,3] is commonly used to assess the reliability of raters. An adjusted kappa does not repair either problem, and seems to make the second one worse.” Di Eugenio and Glass [8] stated that κ is affected by the skewed distributions of categories (the prevalence problem) and by the degree to which coders disagree (the bias problem)

Methods

Results

Discussion

Conclusion