Abstract

In binary classification tasks, Cohen's kappa is often used as a quality measure for data annotations, which is inconsistent with its original purpose as an inter-annotator consistency measure. The analytic relationship between kappa and commonly used classification metrics (e.g., sensitivity and specificity) is nonlinear, and thus is difficult to be applied for interpretation of the classification performance (merely from the knowledge of the kappa value) of the annotations. In this study, based on an annotation generation model, we derive a simplified, linear relationship for Cohen's kappa, sensitivity, and specificity by using the 1st-order Taylor approximation. This relationship is further simplified by relating to Youden's J statistic, a performance metric for binary classification tasks. We provide an analysis on the linear coefficients in the simplified relationship and the approximation error, and conduct a linear regression analysis to assess the relationship by using a synthetic dataset where the ground truth is known. The results show that there is only a negligible approximation error in the simplified relationship when no major bias and prevalence issues exist. Furthermore, the relationship between kappa and Youden's J is validated on an annotation dataset from seven graders in a diabetic retinopathy screening study. The discrepancy between kappa and Youden's J is demonstrated to be an effective measure for annotator assessment when no ground truth is available.

Highlights

  • In medical imaging, almost all state-of-the-art methods for lesion detection and disease diagnosis tasks are developed by applying supervised learning with a binary classification formulation [1]–[4]

  • Pelletier et al [10] studied the effect of annotation noise on classification performance in land cover mapping from satellite time series images; it was concluded that the classifier performance can be adversely affected when the noise levels are higher than 25%–30%

  • Because of this noted inconsistency in the use of kappa, there have been great interests in investigating the relationship between Cohen’s kappa and the performance metrics commonly used in classification tasks [22]–[25]

Read more

Summary

INTRODUCTION

Almost all state-of-the-art methods for lesion detection and disease diagnosis tasks are developed by applying supervised learning with a binary classification formulation [1]–[4]. It is noted that Cohen’s kappa is only intended to evaluate how often the annotators may agree with each other It does not, measure directly the quality (i.e., the accuracy) of annotations for a classification task, where sensitivity and specificity are at the most concern. Because of this noted inconsistency in the use of kappa, there have been great interests in investigating the relationship between Cohen’s kappa and the performance metrics commonly used in classification tasks (i.e. sensitivity and specificity) [22]–[25]. The relationship between kappa and Youden’s J is validated on a real-life dataset collected from a diabetic retinopathy (DR) screening study, wherein the discrepancy between kappa and Youden’s J is applied for annotator assessment

ANNOTATION GENERATION MODEL
COHEN’S KAPPA COEFFICIENT
KAPPA APPROXIMATION
ERROR ANALYSIS IN KAPPA APPROXIMATION
VALIDATION EXPERIMENTS
THE EFFECT OF BIAS ON THE RELATIONSHIP
THE EFFECT OF PREVALENCE ON THE RELATIONSHIP
APPLICATION EXAMPLE
RELATIONSHIP EVALUATION
ANNOTATOR QUALITY EVALUATION
VIII. CONCLUSION
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call