Inter-rater reliability of AMSTAR is dependent on the pair of reviewers

Dawid Pieper,Anja Jacobs,Uta Wegewitz,Beate Weikert,Alba Fishta

doi:10.1186/s12874-017-0380-y

Abstract

BackgroundInter-rater reliability (IRR) is mainly assessed based on only two reviewers of unknown expertise. The aim of this paper is to examine differences in the IRR of the Assessment of Multiple Systematic Reviews (AMSTAR) and R(evised)-AMSTAR depending on the pair of reviewers.MethodsFive reviewers independently applied AMSTAR and R-AMSTAR to 16 systematic reviews (eight Cochrane reviews and eight non-Cochrane reviews) from the field of occupational health. Responses were dichotomized and reliability measures were calculated by applying Holsti’s method (r) and Cohen’s kappa (κ) to all potential pairs of reviewers. Given that five reviewers participated in the study, there were ten possible pairs of reviewers.ResultsInter-rater reliability varied for AMSTAR between r = 0.82 and r = 0.98 (median r = 0.88) using Holsti’s method and κ = 0.41 and κ = 0.69 (median κ = 0.52) using Cohen’s kappa and for R-AMSTAR between r = 0.77 and r = 0.89 (median r = 0.82) and κ = 0.32 and κ = 0.67 (median κ = 0.45) depending on the pair of reviewers. The same pair of reviewers yielded the highest IRR for both instruments. Pairwise Cohen’s kappa reliability measures showed a moderate correlation between AMSTAR and R-AMSTAR (Spearman’s ρ =0.50). The mean inter-rater reliability for AMSTAR was highest for item 1 (κ = 1.00) and item 5 (κ = 0.78), while lowest values were found for items 3, 8, 9 and 11, which showed only fair agreement.ConclusionsInter-rater reliability varies widely depending on the pair of reviewers. There may be some shortcomings associated with conducting reliability studies with only two reviewers. Further studies should include additional reviewers and should probably also take account of their level of expertise.

Highlights

Inter-rater reliability (IRR) is mainly assessed based on only two reviewers of unknown expertise
The results of the telephone conference were collected on an instrument and item basis and were made available to all reviewers once all reviewers agreed on all amendments
The statistically significant result was that Cochrane Reviews (CRs) obtained more “yes” items than non-Cochrane reviews (nCRs) (AMSTAR: 9 vs. 5.5, p < 0.001; RAMSTAR: 39 vs. 32.5, p < 0.001)

Summary

Introduction

Inter-rater reliability (IRR) is mainly assessed based on only two reviewers of unknown expertise. The aim of this paper is to examine differences in the IRR of the Assessment of Multiple Systematic Reviews (AMSTAR) and R(evised)-AMSTAR depending on the pair of reviewers. Measurement can be described as the process of systematically assigning numbers or labels to objects and their properties. For example, measurement questions can range from symptoms, physical examinations, laboratory tests and imaging to self-report questionnaires. The measurements obtained can be used as a basis for subsequent decisions (e.g. regarding treatments). It is important that measurements are reliable and valid, aspects which are referred to as measurement properties. There is a serious risk of imprecise or biased results that could lead to incorrect decisions or conclusions.

Objectives

Methods

Results

Discussion

Conclusion