Abstract

Dear Editor, We thank Loef and colleagues for their interest in our article about the revised Cochrane risk of bias tool for randomized trials (RoB 2) inter-rater reliability and usability [[1]Minozzi S. Cinquini M. Gianola S. Gonzalez-Lorenzo M. Banzi R. The revised Cochrane risk of bias tool for randomized trials (RoB 2) showed low interrater reliability and challenges in its application.J Clin Epidemiol. 2020; 126: 37-44Abstract Full Text Full Text PDF PubMed Scopus (37) Google Scholar]. Their analysis and comments mainly focused on the choice of the more appropriate measure of inter-rater agreement. We acknowledge the paradoxes that may occur with the Kappa statistics, and referred to it as possible limitation of our results: “The expected agreement can exceed the observed agreement and then generate kappa values lower than 0. This is why in some cases Fleiss’ kappa may return low values even when agreement is actually high (Fleiss’ kappa paradox). Our results might be further affected by this paradox.”[[1]Minozzi S. Cinquini M. Gianola S. Gonzalez-Lorenzo M. Banzi R. The revised Cochrane risk of bias tool for randomized trials (RoB 2) showed low interrater reliability and challenges in its application.J Clin Epidemiol. 2020; 126: 37-44Abstract Full Text Full Text PDF PubMed Scopus (37) Google Scholar]. We agree that Gwet's AC1/2 statistics may overcome this paradoxes and be a valuable option. The two approaches differ in how the expected agreement by chance (“pe”) is calculated and this can produce an overestimation of the agreement with Gwet's AC1/2 statistics. As expected, the reanalysis of the data using this statistic led to a higher agreement, which would have been classified as moderate. We believe the main reason for this difference can be due to the sample size of our analysis (70 measurements). The two approaches are likely to estimate similar agreement values when a higher number of measurements are made, while results may be highly variable if samples are small [[2]Ohyama Tetsuji Statistical inference of Gwet's AC1 coefficient for multiple raters and binary outcomes.Commun in Statis - Theory and Meths. 2021; 50: 3564-3572https://doi.org/10.1080/03610926.2019.1708397Crossref Scopus (3) Google Scholar,[3]Wongpakaran N. Wongpakaran T. Wedding D. L Gwet K.L. A comparison of Cohen’s Kappa and Gwet’s AC1 when calculating inter-rater reliability coefficients: a study conducted with personality disorder samples.BMC Med Res Methodol. 2013; 13: 61https://doi.org/10.1186/1471-2288-13-61Crossref PubMed Scopus (417) Google Scholar]. We compared the number of studies where all the raters gave the same judgement and the interrater agreement values (Table 1) and found a good consistency. For instance, the raters gave the same overall judgement in 12 of 70 (17%) of the studies and this is consistent with the slight interrater agreement (IRR = 0.16) measured with Fleiss’ Kappa [[1]Minozzi S. Cinquini M. Gianola S. Gonzalez-Lorenzo M. Banzi R. The revised Cochrane risk of bias tool for randomized trials (RoB 2) showed low interrater reliability and challenges in its application.J Clin Epidemiol. 2020; 126: 37-44Abstract Full Text Full Text PDF PubMed Scopus (37) Google Scholar]. This seems to apply also to single domains.Table 1Interrater agreement measures for single domains and overall judgment compared with number studies where all the raters gave the same ROB judgementROB 2 domain% same ROB judgementIRR (95% CI)1470.45 (0.37 to 0.53)220Assignment 0.04 (-0.06 to 0.14)Adhering 0.21 (0.11 to 0.31)3580.22 (0.14 to 0.30)4500.27 (0.19 to 0.35)5310.30 (0.22 to 0.38)Overall170.16 (0.08 to 0.24) Open table in a new tab Simulation studies focused on the comparison between Gwet's AC1/2 statistics and Cohen's kappa for binary data. To our knowledge, robust comparison between Gwet's AC1/2 and Fleiss’ Kappa are lacking, making difficult any strong conclusion about the best approach. Both approaches rely on several assumptions about the experience of raters. The violation of these assumptions could contribute to produce paradoxical results. In practice, it is almost inevitable that the raters will vary in terms of their experience and expertise, and this was the case in our study. We believed Fliess Kappa was appropriate to our analysis given that our raters had different competences, expertise and training. Indeed, one of Kappa statistic's assumption is that raters’ judgment is independent. Regarding the second issue, we chose the Landis and Koch classification in a conservative way considering the point estimate only. However, cIMP may be a good alternative approach we will consider it in future analysis.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call