Evaluation of Cohen's kappa and other measures of inter-rater agreement for genre analysis and other nominal data

Gerald Rau,Yu-Shan Shih

doi:10.1016/j.jeap.2021.101026

Abstract

Cohen's kappa (κ) is often recommended for nominal data as a measure of inter-rater (inter-coder) agreement or reliability. In this paper we ask which term is appropriate in genre analysis, what statistical measures are valid to measure it, and how much the choice of units affects the values obtained. We find that although both agreement and reliability may be of interest, only agreement can be measured with nominal data. Moreover, while kappa may be appropriate for macrostructure or corpus analysis, it is inappropriate for move or component analysis, due to the requirement of κ that the units be predetermined, fixed, and independent. κ further assumes that all disagreements in category assignment are equally likely, which may not be true. We also describe other measures, including correlation, chi square, and percent agreement, and demonstrate that despite its limitations, percent agreement is the only valid measure in many situations. Finally, we demonstrate why choice of unit has a large effect on the value calculated. These findings also apply to other studies in applied linguistics using nominal data. We conclude that the methodology used needs to be clearly explained to ensure that the requirements have been met, as in any other statistical testing.

Full Text