A paired kappa to compare binary ratings across two medical tests.

Kerrie P Nelson,Don Edwards

doi:10.1002/sim.8200

Abstract

Agreement between experts' ratings is an important prerequisite for an effective screening procedure. In clinical settings, large-scale studies are often conducted to compare the agreement of experts' ratings between new and existing medical tests, for example, digital versus film mammography. Challenges arise in these studies where many experts rate the same sample of patients undergoing two medical tests, leading to a complex correlation structure between experts' ratings. Here, we propose a novel paired kappa measure to compare the agreement between the binary ratings of many experts across two medical tests. Existing approaches can accommodate only a small number of experts, rely heavily on Cohen's kappa and Scott's pi measures of agreement, and thus are prone to their drawbacks. The proposed kappa appropriately accounts for correlations between ratings due to patient characteristics, corrects for agreement due to chance, and is robust to disease prevalence and other flaws inherent in the use of Cohen's kappa. It can be easily calculated in the software package R. In contrast to existing approaches, the proposed measure can flexibly incorporate large numbers of experts and patients by utilizing the generalized linear mixed models framework. It is intended to be used in population-based studies, increasing efficiency without increasing modeling complexity. Extensive simulation studies demonstrate low bias and excellent coverage probability of the proposed kappa under a broad range of conditions. Methods are applied to a recent nationwide breast cancer screening study comparing film mammography to digital mammography.

Full Text