From the Trenches: A Cross-Sectional Study Applying the GRADE Tool in Systematic Reviews of Healthcare Interventions

Lisa Hartling,Jennifer Seida,Ben Vandermeer,Ricardo M Fernandes,Donna M Dryden,German Malaga

doi:10.1371/journal.pone.0034697

Abstract

BackgroundGRADE was developed to address shortcomings of tools to rate the quality of a body of evidence. While much has been published about GRADE, there are few empirical and systematic evaluations.ObjectiveTo assess GRADE for systematic reviews (SRs) in terms of inter-rater agreement and identify areas of uncertainty.DesignCross-sectional, descriptive study.MethodsWe applied GRADE to three SRs (n = 48, 66, and 75 studies, respectively) with 29 comparisons and 12 outcomes overall. Two reviewers graded evidence independently for outcomes deemed clinically important a priori. Inter-rater reliability was assessed using kappas for four main domains (risk of bias, consistency, directness, and precision) and overall quality of evidence.ResultsFor the first review, reliability was: κ = 0.41 for risk of bias; 0.84 consistency; 0.18 precision; and 0.44 overall quality. Kappa could not be calculated for directness as one rater assessed all items as direct; assessors agreed in 41% of cases. For the second review reliability was: 0.37 consistency and 0.19 precision. Kappa could not be assessed for other items; assessors agreed in 33% of cases for risk of bias; 100% directness; and 58% overall quality. For the third review, reliability was: 0.06 risk of bias; 0.79 consistency; 0.21 precision; and 0.18 overall quality. Assessors agreed in 100% of cases for directness. Precision created the most uncertainty due to difficulties in identifying “optimal” information size and “clinical decision threshold”, as well as making assessments when there was no meta-analysis. The risk of bias domain created uncertainty, particularly for nonrandomized studies.ConclusionsAs researchers with varied levels of training and experience use GRADE, there is risk for variability in interpretation and application. This study shows variable agreement across the GRADE domains, reflecting areas where further guidance is required.

Highlights

The GRADE tool (Grading of Recommendations Assessment, Development and Evaluation) has been developed and refined over recent years by an international working group
This study shows variable agreement across the GRADE domains, reflecting areas where further guidance is required
It represents an important tool for decision-makers as it provides a mechanism to bridge the gap from evidence synthesis to application of the evidence for informed decision-making

Summary

Introduction

The GRADE tool (Grading of Recommendations Assessment, Development and Evaluation) has been developed and refined over recent years by an international working group (www. gradeworkinggroup.org). One of the motivations for developing the tool was to address shortcomings of other approaches to rating the strength or quality of a body of evidence. A series of publications about the GRADE tool was published in the Journal of Clinical Epidemiology [3,4,5,6,7,8,9,10,11,12]. These reports provide details about the development of the tool and general instructions on how the tool should be applied. While much has been published about GRADE, there are few empirical and systematic evaluations

Methods

Results

Discussion

Conclusion